* Slow ceph fs performance
@ 2012-09-26 14:50 Bryan K. Wright
2012-09-26 15:26 ` Mark Nelson
0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-26 14:50 UTC (permalink / raw)
To: ceph-devel
Hi folks,
I'm seeing reasonable performance when I run rados
benchmarks, but really slow I/O when reading or writing
from a mounted ceph filesystem. The rados benchmarks
show about 150 MB/s for both read and write, but when I
go to a client machine with a mounted ceph filesystem
and try to rsync a large (60 GB) directory tree onto
the ceph fs, I'm getting rates of only 2-5 MB/s.
The OSDs and MDSs are all running 64-bit CentOS 6.3
with the stock CentOS 2.6.32 kernel. The client is also
64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel.
There are four OSDs, each with a hardware RAID 5 array
and an SSD for the OSD journal. The primary network
is a gigabit network, and the OSD, MDS and MON
machines have a dedicated backend gigabit network on a
second network interface.
Locally on the OSD, "hdparm -t -T" reports read rates
of ~350 MB/s, and bonnie++ shows:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
osd-local 23800M 1037 99 316048 92 131023 19 2272 98 312781 21 521.0 24
Latency 13103us 183ms 123ms 15316us 100ms 75899us
Version 1.96 ------Sequential Create------ --------Random Create--------
osd-local -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 16817 55 +++++ +++ 28786 77 23890 78 +++++ +++ 27128 75
Latency 21549us 105us 134us 902us 12us 104us
While rsyncing the files, the ceph logs show lots
of warnings of the form:
[WRN] : slow request 91.848407 seconds old, received at 2012-09-26 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops
Snooping on traffic with wireshark shows bursts of
activity separated by long periods (30-60 sec) of idle time.
My first thought was that I was seeing a kind of
"bufferbloat". The SSDs are 120 GB, so they could easily contain
enough data to take a long time to dump. I changed to using a
journal file, limited to 1 GB, but I still see the same slow
behavior.
Any advice about how to go about debugging this would
be appreciated.
Thanks,
Bryan
--
========================================================================
Bryan Wright |"If you take cranberries and stew them like
Physics Department | applesauce, they taste much more like prunes
University of Virginia | than rhubarb does." -- Groucho
Charlottesville, VA 22901|
(434) 924-7218 | bryan@virginia.edu
========================================================================
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: Slow ceph fs performance 2012-09-26 14:50 Slow ceph fs performance Bryan K. Wright @ 2012-09-26 15:26 ` Mark Nelson 2012-09-26 20:54 ` Bryan K. Wright 0 siblings, 1 reply; 23+ messages in thread From: Mark Nelson @ 2012-09-26 15:26 UTC (permalink / raw) To: bryan; +Cc: Bryan K. Wright, ceph-devel On 09/26/2012 09:50 AM, Bryan K. Wright wrote: > Hi folks, Hi Bryan! > > I'm seeing reasonable performance when I run rados > benchmarks, but really slow I/O when reading or writing > from a mounted ceph filesystem. The rados benchmarks > show about 150 MB/s for both read and write, but when I > go to a client machine with a mounted ceph filesystem > and try to rsync a large (60 GB) directory tree onto > the ceph fs, I'm getting rates of only 2-5 MB/s. Was the rados benchmark run from the same client machine that the filesystem is being mounted on? Also, what object size did you use for rados bench? Does the directory tree have a lot of small files or a few very large ones? > > The OSDs and MDSs are all running 64-bit CentOS 6.3 > with the stock CentOS 2.6.32 kernel. The client is also > 64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel. > There are four OSDs, each with a hardware RAID 5 array > and an SSD for the OSD journal. The primary network > is a gigabit network, and the OSD, MDS and MON > machines have a dedicated backend gigabit network on a > second network interface. > > Locally on the OSD, "hdparm -t -T" reports read rates > of ~350 MB/s, and bonnie++ shows: > > Version 1.96 ------Sequential Output------ --Sequential Input- --Random- > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > osd-local 23800M 1037 99 316048 92 131023 19 2272 98 312781 21 521.0 24 > Latency 13103us 183ms 123ms 15316us 100ms 75899us > Version 1.96 ------Sequential Create------ --------Random Create-------- > osd-local -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 16817 55 +++++ +++ 28786 77 23890 78 +++++ +++ 27128 75 > Latency 21549us 105us 134us 902us 12us 104us > > > While rsyncing the files, the ceph logs show lots > of warnings of the form: > > [WRN] : slow request 91.848407 seconds old, received at 2012-09-26 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops > > Snooping on traffic with wireshark shows bursts of > activity separated by long periods (30-60 sec) of idle time. > My guess here is that if there is a lot of small IO happening, your SSD journal is handling it well and probably writing data really quickly, while your spinning disk raid5 probably can't sustain anywhere near the required IOPs to keep up. So you get a burst of network traffic and the journal writes it to the SSD quickly until it is filled up, then the OSD stalls while it waits for the raid5 to write data out. Whenever the journal flushes, a new burst of traffic comes in and the process repeats. > My first thought was that I was seeing a kind of > "bufferbloat". The SSDs are 120 GB, so they could easily contain > enough data to take a long time to dump. I changed to using a > journal file, limited to 1 GB, but I still see the same slow > behavior. > > Any advice about how to go about debugging this would > be appreciated. It'd probably be useful to look at the write sizes going to disk. Increasing debugging levels in the Ceph logs will give you that, but it can be a lot to parse. You can also use something like iostat or collectl to see what the per-second average write sizes are. > > Thanks, > Bryan > Mark ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-26 15:26 ` Mark Nelson @ 2012-09-26 20:54 ` Bryan K. Wright 2012-09-27 15:16 ` Bryan K. Wright ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Bryan K. Wright @ 2012-09-26 20:54 UTC (permalink / raw) To: Mark Nelson; +Cc: ceph-devel Hi Mark, Thanks for your help. Some answers to your questions are below. mark.nelson@inktank.com said: > On 09/26/2012 09:50 AM, Bryan K. Wright wrote: > Hi folks, > Hi Bryan! > > > I'm seeing reasonable performance when I run rados > benchmarks, but really slow I/O when reading or writing > from a mounted ceph filesystem. The rados benchmarks > show about 150 MB/s for both read and write, but when I > go to a client machine with a mounted ceph filesystem > and try to rsync a large (60 GB) directory tree onto > the ceph fs, I'm getting rates of only 2-5 MB/s. > Was the rados benchmark run from the same client machine that the filesystem > is being mounted on? Also, what object size did you use for rados bench? > Does the directory tree have a lot of small files or a few very large ones? The rados benchmark was run on one of the OSD machines. Read and write results looked like this (the objects size was just the default, which seems to be 4kB): # rados bench -p pbench 900 write Total time run: 900.549729 Total writes made: 33819 Write size: 4194304 Bandwidth (MB/sec): 150.215 Stddev Bandwidth: 16.2592 Max bandwidth (MB/sec): 212 Min bandwidth (MB/sec): 84 Average Latency: 0.426028 Stddev Latency: 0.24688 Max latency: 1.59936 Min latency: 0.06794 # rados bench -p pbench 900 seq Total time run: 900.572788 Total reads made: 33676 Read size: 4194304 Bandwidth (MB/sec): 149.576 Average Latency: 0.427844 Max latency: 1.48576 Min latency: 0.015371 Regarding the rsync test, yes, the directory tree was mostly small files. > > > The OSDs and MDSs are all running 64-bit CentOS 6.3 > with the stock CentOS 2.6.32 kernel. The client is also > 64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel. > There are four OSDs, each with a hardware RAID 5 array > and an SSD for the OSD journal. The primary network > is a gigabit network, and the OSD, MDS and MON > machines have a dedicated backend gigabit network on a > second network interface. > > Locally on the OSD, "hdparm -t -T" reports read rates > of ~350 MB/s, and bonnie++ shows: > > Version 1.96 ------Sequential Output------ --Sequential Input- > --Random- > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec > %CP > osd-local 23800M 1037 99 316048 92 131023 19 2272 98 312781 21 521.0 > 24 > Latency 13103us 183ms 123ms 15316us 100ms 75899us > Version 1.96 ------Sequential Create------ --------Random > Create-------- > osd-local -Create-- --Read--- -Delete-- -Create-- --Read--- > -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP > 16 16817 55 +++++ +++ 28786 77 23890 78 +++++ +++ 27128 > 75 > Latency 21549us 105us 134us 902us 12us 104us > > > > While rsyncing the files, the ceph logs show lots > of warnings of the form: > > [WRN] : slow request 91.848407 seconds old, received at 2012-09-26 > 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write > 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops > > Snooping on traffic with wireshark shows bursts of > activity separated by long periods (30-60 sec) of idle time. > > My guess here is that if there is a lot of small IO happening, your SSD > journal is handling it well and probably writing data really quickly, while > your spinning disk raid5 probably can't sustain anywhere near the required > IOPs to keep up. So you get a burst of network traffic and the journal > writes it to the SSD quickly until it is filled up, then the OSD stalls while > it waits for the raid5 to write data out. Whenever the journal flushes, a > new burst of traffic comes in and the process repeats. That sure sounds reasonable. Maybe I can play some more with the journal size and location to see how it affects the speed and burstyness. > My first thought was that I was seeing a kind of > "bufferbloat". The SSDs are 120 GB, so they could easily contain > enough data to take a long time to dump. I changed to using a > journal file, limited to 1 GB, but I still see the same slow > behavior. > > Any advice about how to go about debugging this would > be appreciated. > It'd probably be useful to look at the write sizes going to disk. Increasing > debugging levels in the Ceph logs will give you that, but it can be a lot to > parse. You can also use something like iostat or collectl to see what the > per-second average write sizes are. I'll see what I can find out. Here's a quick output from iostat (on one of the OSD hosts) while an rsync was running: avg-cpu: %user %nice %system %iowait %steal %idle 0.23 0.00 0.20 0.21 0.00 99.36 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdm 0.96 5.82 19.94 4523588 15495690 sdn 9.96 1.51 1080.91 1174143 839900311 sdb 0.00 0.00 0.00 2248 0 sdc 0.00 0.00 0.00 2248 0 sde 0.00 0.00 0.00 2248 0 sda 0.00 0.00 0.00 2248 0 sdf 0.00 0.00 0.00 2248 0 sdi 0.00 0.00 0.00 2248 0 sdl 0.00 0.00 0.00 2248 0 sdg 0.00 0.00 0.00 2248 0 sdj 0.00 0.00 0.00 2248 0 sdh 0.00 0.00 0.00 2248 0 sdd 0.00 0.00 0.00 2248 0 sdk 0.00 0.00 0.00 2248 0 dm-0 0.00 0.00 0.00 2616 0 dm-1 2.14 5.81 19.80 4512994 15387832 sdo 96.83 305.85 3156.74 237658672 2452896474 dm-2 0.00 0.00 0.00 800 48 The relevant lines are "sdo", which is the RAID array where the object store lives, and "sdn", which is the journal SSD. > > > Thanks, > Bryan > > Mark -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@virginia.edu ======================================================================== ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-26 20:54 ` Bryan K. Wright @ 2012-09-27 15:16 ` Bryan K. Wright 2012-09-27 18:04 ` Gregory Farnum 2012-09-27 23:40 ` Mark Kirkwood 2 siblings, 0 replies; 23+ messages in thread From: Bryan K. Wright @ 2012-09-27 15:16 UTC (permalink / raw) To: ceph-devel Hi folks, I'm still struggling to get decent performance out of cephfs. I've played around with journal size and location, but I/O rates to the mounted ceph filesystem always hover in the range of 2-6 MB/sec while rsyncing a large directory tree onto the ceph fs. In contrast, using rsync over ssh to copy the same tree on to the same RAID array on one of the OSDs gives a rate of about 34 MB/sec. Here's a time/sequence plot from wireshark showing what the traffic looks like from the client's perspective while rsyncing onto the ceph fs: http://ayesha.phys.virginia.edu/~bryan/time-sequence-ceph-2.png As you can see, most of the time is spent in long waits between bursts of packets. Using a small journal file instead of a whole SSD seems to slightly reduce the delays, but not by much. What other tunable parameters should I be trying? Looking at outgoing network rates on the client with iptraf, I see the following while rsyncing over ssh: Rate: ~300Mb/s, ~8k packets/s --> ~40kb/packet While rsyncing to the ceph fs, I see: Rate: ~50Mb/s, ~1k packets/s --> ~50kb/packet (i.e., the average packet size is about the same, but about eight times fewer packets are being sent per unit time.) Looking at ops in flight on one of the OSDs, using "ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight", I see: { "num_ops": 3, "ops": [ { "description": "pg_log(0.8 epoch 12 query_epoch 12)", "received_at": "2012-09-27 10:54:08.070493", "age": "66.673834", "flag_point": "delayed"}, { "description": "pg_log(1.7 epoch 12 query_epoch 12)", "received_at": "2012-09-27 10:54:08.070715", "age": "66.673612", "flag_point": "delayed"}, { "description": "pg_log(2.6 epoch 12 query_epoch 12)", "received_at": "2012-09-27 10:54:08.070750", "age": "66.673577", "flag_point": "delayed"}]} Thanks for any advice. Bryan bkw1a@ayesha.phys.virginia.edu said: > Hi folks, > I'm seeing reasonable performance when I run rados benchmarks, but really > slow I/O when reading or writing from a mounted ceph filesystem. The rados > benchmarks show about 150 MB/s for both read and write, but when I go to a > client machine with a mounted ceph filesystem and try to rsync a large (60 GB) > directory tree onto the ceph fs, I'm getting rates of only 2-5 MB/s. > The OSDs and MDSs are all running 64-bit CentOS 6.3 with the stock CentOS > 2.6.32 kernel. The client is also 64-bit CentOS 6.3, but it's running the > "elrepo" 3.5.4 kernel. There are four OSDs, each with a hardware RAID 5 array > and an SSD for the OSD journal. The primary network is a gigabit network, and > the OSD, MDS and MON machines have a dedicated backend gigabit network on a > second network interface. > Locally on the OSD, "hdparm -t -T" reports read rates of ~350 MB/s, and > bonnie++ shows: > Version 1.96 ------Sequential Output------ --Sequential Input- > --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- > --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec > %CP K/sec %CP /sec %CP osd-local 23800M 1037 99 316048 92 131023 19 > 2272 98 312781 21 521.0 24 Latency 13103us 183ms 123ms > 15316us 100ms 75899us Version 1.96 ------Sequential Create------ > --------Random Create-------- osd-local -Create-- --Read--- > -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP > 16 16817 55 +++++ +++ 28786 77 23890 78 +++++ +++ 27128 > 75 Latency 21549us 105us 134us 902us 12us > 104us > While rsyncing the files, the ceph logs show lots of warnings of the form: > [WRN] : slow request 91.848407 seconds old, received at 2012-09-26 > 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write > 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops > Snooping on traffic with wireshark shows bursts of activity separated by > long periods (30-60 sec) of idle time. > My first thought was that I was seeing a kind of "bufferbloat". The SSDs are > 120 GB, so they could easily contain enough data to take a long time to dump. > I changed to using a journal file, limited to 1 GB, but I still see the same > slow behavior. > Any advice about how to go about debugging this would be appreciated. > Thanks, > Bryan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-26 20:54 ` Bryan K. Wright 2012-09-27 15:16 ` Bryan K. Wright @ 2012-09-27 18:04 ` Gregory Farnum 2012-09-27 18:47 ` Bryan K. Wright 2012-10-01 16:47 ` Tommi Virtanen 2012-09-27 23:40 ` Mark Kirkwood 2 siblings, 2 replies; 23+ messages in thread From: Gregory Farnum @ 2012-09-27 18:04 UTC (permalink / raw) To: bryan; +Cc: Mark Nelson, ceph-devel On Wed, Sep 26, 2012 at 1:54 PM, Bryan K. Wright <bkw1a@ayesha.phys.virginia.edu> wrote: > Hi Mark, > > Thanks for your help. Some answers to your questions > are below. > > mark.nelson@inktank.com said: >> On 09/26/2012 09:50 AM, Bryan K. Wright wrote: >> Hi folks, >> Hi Bryan! >> > >> I'm seeing reasonable performance when I run rados >> benchmarks, but really slow I/O when reading or writing >> from a mounted ceph filesystem. The rados benchmarks >> show about 150 MB/s for both read and write, but when I >> go to a client machine with a mounted ceph filesystem >> and try to rsync a large (60 GB) directory tree onto >> the ceph fs, I'm getting rates of only 2-5 MB/s. >> Was the rados benchmark run from the same client machine that the filesystem >> is being mounted on? Also, what object size did you use for rados bench? >> Does the directory tree have a lot of small files or a few very large ones? > > The rados benchmark was run on one of the OSD > machines. Read and write results looked like this (the > objects size was just the default, which seems to be 4kB): Actually, that's 4MB. ;) Can you run # rados bench -p pbench 900 write -t 256 -b 4096 and see what that gets? It'll run 256 simultaneous 4KB writes. (You can also vary the number of simultaneous writes and see if that impacts it.) However, my suspicion is that you're limited by metadata throughput here. How large are your files? There might be some MDS or client tunables we can adjust, but rsync's workload is a known weak spot for CephFS. -Greg ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-27 18:04 ` Gregory Farnum @ 2012-09-27 18:47 ` Bryan K. Wright 2012-09-27 19:47 ` Gregory Farnum 2012-10-01 16:47 ` Tommi Virtanen 1 sibling, 1 reply; 23+ messages in thread From: Bryan K. Wright @ 2012-09-27 18:47 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel greg@inktank.com said: > > > The rados benchmark was run on one of the OSD > machines. Read and write results looked like this (the > objects size was just the default, which seems to be 4kB): > Actually, that's 4MB. ;) Oops! My plea is that I was the victim of a man page bug: bench seconds mode [ -b objsize ] [ -t threads ] Benchmark for seconds. The mode can be write or read. The default object size is 4 KB, and the default number of simulated threads (parallel writes) is 16. > Can you run # rados bench -p pbench 900 write -t 256 > -b 4096 and see what that gets? It'll run 256 simultaneous 4KB writes. (You > can also vary the number of simultaneous writes and see if that impacts it.) Here's the new benchmark output: Total time run: 900.880070 Total writes made: 537187 Write size: 4096 Bandwidth (MB/sec): 2.329 Stddev Bandwidth: 2.57691 Max bandwidth (MB/sec): 12.6055 Min bandwidth (MB/sec): 0 Average Latency: 0.429315 Stddev Latency: 0.891734 Max latency: 19.7647 Min latency: 0.016743 > However, my suspicion is that you're limited by metadata throughput here. How > large are your files? There might be some MDS or client tunables we can > adjust, but rsync's workload is a known weak spot for CephFS. -Greg The file size is generally small. Here's the distribution: http://ayesha.phys.virginia.edu/~bryan/filesize.png The mean is about 2.5 MB. Bryan -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@virginia.edu ======================================================================== ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-27 18:47 ` Bryan K. Wright @ 2012-09-27 19:47 ` Gregory Farnum 0 siblings, 0 replies; 23+ messages in thread From: Gregory Farnum @ 2012-09-27 19:47 UTC (permalink / raw) To: Bryan K. Wright; +Cc: ceph-devel On Thu, Sep 27, 2012 at 11:47 AM, Bryan K. Wright <bkw1a@ayesha.phys.virginia.edu> wrote: > > greg@inktank.com said: >> > >> The rados benchmark was run on one of the OSD >> machines. Read and write results looked like this (the >> objects size was just the default, which seems to be 4kB): >> Actually, that's 4MB. ;) > > Oops! My plea is that I was the victim of a > man page bug: > > bench seconds mode [ -b objsize ] [ -t threads ] > Benchmark for seconds. The mode can be write or read. The > default object size is 4 KB, and the default number of simulated > threads (parallel writes) is 16. Whoops! I'd fix it but it's obfuscated somewhat now, so: http://tracker.newdream.net/issues/3230 > > >> Can you run # rados bench -p pbench 900 write -t 256 >> -b 4096 and see what that gets? It'll run 256 simultaneous 4KB writes. (You >> can also vary the number of simultaneous writes and see if that impacts it.) > > Here's the new benchmark output: > > Total time run: 900.880070 > Total writes made: 537187 > Write size: 4096 > Bandwidth (MB/sec): 2.329 > > Stddev Bandwidth: 2.57691 > Max bandwidth (MB/sec): 12.6055 > Min bandwidth (MB/sec): 0 > Average Latency: 0.429315 > Stddev Latency: 0.891734 > Max latency: 19.7647 > Min latency: 0.016743 Hmm, that is significantly lower than I would have expected. Can you check and see if you can get that number higher by increasing (or decreasing) the number of in-flight ops? (-t param) Given your size distribution, it could just be that your RAID arrays aren't giving you the small random write throughput you expect. >> However, my suspicion is that you're limited by metadata throughput here. How >> large are your files? There might be some MDS or client tunables we can >> adjust, but rsync's workload is a known weak spot for CephFS. -Greg > > The file size is generally small. Here's the distribution: > > http://ayesha.phys.virginia.edu/~bryan/filesize.png > > The mean is about 2.5 MB. So that chart is measuring in KB? Anyway, it might be metadata — you could see what the CPU usage on the MDS server looks like while running the rsync. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-27 18:04 ` Gregory Farnum 2012-09-27 18:47 ` Bryan K. Wright @ 2012-10-01 16:47 ` Tommi Virtanen 2012-10-01 17:00 ` Gregory Farnum 2012-10-01 17:03 ` Mark Nelson 1 sibling, 2 replies; 23+ messages in thread From: Tommi Virtanen @ 2012-10-01 16:47 UTC (permalink / raw) To: Gregory Farnum; +Cc: bryan, Mark Nelson, ceph-devel On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum <greg@inktank.com> wrote: > However, my suspicion is that you're limited by metadata throughput > here. How large are your files? There might be some MDS or client > tunables we can adjust, but rsync's workload is a known weak spot for > CephFS. I feel like people are missing this part of Greg's message. Everyone is so busy benchmarking RADOS small I/O, but what if it's currently bottlenecked by all the file-level access operations that interact with the MDS? Rsync causes a ton of those. If you want to benchmark just the small IO, you can't compare rsync to rsync. If you want to benchmark just the metadata part, rsync with 0-size files might actually be an interesting workload. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-01 16:47 ` Tommi Virtanen @ 2012-10-01 17:00 ` Gregory Farnum 2012-10-03 14:55 ` Bryan K. Wright 2012-10-01 17:03 ` Mark Nelson 1 sibling, 1 reply; 23+ messages in thread From: Gregory Farnum @ 2012-10-01 17:00 UTC (permalink / raw) To: Tommi Virtanen; +Cc: bryan, Mark Nelson, ceph-devel On Mon, Oct 1, 2012 at 9:47 AM, Tommi Virtanen <tv@inktank.com> wrote: > On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum <greg@inktank.com> wrote: >> However, my suspicion is that you're limited by metadata throughput >> here. How large are your files? There might be some MDS or client >> tunables we can adjust, but rsync's workload is a known weak spot for >> CephFS. > > I feel like people are missing this part of Greg's message. Everyone > is so busy benchmarking RADOS small I/O, but what if it's currently > bottlenecked by all the file-level access operations that interact > with the MDS? Rsync causes a ton of those. Yes. Bryan, you mentioned that you didn't see a lot of resource usage — was it perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in theory, but in practice it has the equivalent of a Big Kernel Lock so it's not going to get much past one cpu core of time... The rados bench results do indicate some pretty bad small-file write performance as well though, so I guess it's possible your testing is running long enough that the page cache isn't absorbing that hit. Did performance start out higher or has it been flat? > If you want to benchmark just the small IO, you can't compare rsync to rsync. > > If you want to benchmark just the metadata part, rsync with 0-size > files might actually be an interesting workload. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-01 17:00 ` Gregory Farnum @ 2012-10-03 14:55 ` Bryan K. Wright 2012-10-03 18:35 ` Gregory Farnum 0 siblings, 1 reply; 23+ messages in thread From: Bryan K. Wright @ 2012-10-03 14:55 UTC (permalink / raw) To: ceph-devel Hi again, A few answers to questions from various people on the list after my last e-mail: greg@inktank.com said: > Yes. Bryan, you mentioned that you didn't see a lot of resource usage was it > perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in > theory, but in practice it has the equivalent of a Big Kernel Lock so it's not > going to get much past one cpu core of time... The CPU usage on the MDSs hovered around a few percent. They're quad-core machines, and I didn't see it ever get as high as 25% usage on any of the cores while watching with atop. greg@inktank.com said: > The rados bench results do indicate some pretty bad small-file write > performance as well though, so I guess it's possible your testing is running > long enough that the page cache isn't absorbing that hit. Did performance > start out higher or has it been flat? Looking at the details of the rados benchmark output, it does look like performance starts out better for the first few iterations, and then goes bad. Here's the begining of a typical small-file run: Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 255 3683 3428 13.3894 13.3906 0.002569 0.0696906 2 256 7561 7305 14.2661 15.1445 0.106437 0.0669534 3 256 10408 10152 13.2173 11.1211 0.002176 0.0689543 4 256 11256 11000 10.741 3.3125 0.002097 0.0846414 5 256 11256 11000 8.5928 0 - 0.0846414 6 256 11370 11114 7.23489 0.222656 0.002399 0.0962989 7 255 12480 12225 6.82126 4.33984 0.117658 0.142335 8 256 13289 13033 6.36311 3.15625 0.002574 0.151261 9 256 13737 13481 5.85051 1.75 0.120657 0.158865 10 256 14341 14085 5.50138 2.35938 0.022544 0.178298 I see the same behavior every time I repeat the small-file rados benchmark. Here's a graph showing the first 100 "cur MB/s" values for a short-file benchmark: http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf On the other hand, with 4MB files, I see results that start out like this: Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 49 49 0 0 0 - 0 2 76 76 0 0 0 - 0 3 105 105 0 0 0 - 0 4 133 133 0 0 0 - 0 5 159 159 0 0 0 - 0 6 188 188 0 0 0 - 0 7 218 218 0 0 0 - 0 8 246 246 0 0 0 - 0 9 256 274 18 7.99904 8 8.97759 8.66218 10 255 301 46 18.3978 112 9.1456 8.94095 11 255 330 75 27.2695 116 9.06968 9.013 12 255 358 103 34.3292 112 9.12486 9.04374 Here's a graph showing the first 100 "cur MB/s" values for a typical 4MB file benchmark: http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf mark.nelson@inktank.com said: > When you were doing this, what kind of results did collectl give you for > average write sizes to the underlying OSD disks? The average "rwsize" reported by collectl hovered around 6 +/- a few (in whatever units collectl reports) for the RAID array, and around 15 for the journal SSD, while doing the small-file rados benchmark. Here's a screenshot showing atop running on each of the MDS hosts, and collectl running on each of the OSD hosts, while the benchmark was running: http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png Here's the same, but with collectl running on the MDSs instead of atop: http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png Looking at the last screenshot again, it does look like the disks on the MDSs are getting some exercise, with ~40% utilization (if I'm interpreting the collectl output correctly). Here's a similar snapshot for the 4MB test: http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png It looks like similar "pct util" on the MDS disks, but much higher average rwsize values on the OSDs. mark.nelson@inktank.com said: > There's multiple issues potentially here. Part of it might be how writes are > coalesced by XFS in each scenario. Part of it might also be overhead due to > XFS metadata reads/writes. You could probably get a better idea of both of > these by running blktrace during the tests and making seekwatcher movies of > the results. You not only can look at the numbers of seeks, but also the > kind (read/writes) and where on the disk they are going. That, and some of > the raw blktrace data can give you a lot of information about what is going > on and whether or not seeks are I'll take a look at blktrace and see what I can find out. mark.nelson@inktank.com said: > Beyond that, I do think you are correct in suspecting that there are some > Ceph limitations as well. Some things that may be interesting to try: > - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various > thread counts - Increasing various op and byte limits (such as > journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under > the OSDs. And I'll give some of these a try. Regarding the iozone benchmarks: mark.nelson@inktank.com said: > Do you happen to have the settings you used when you ran these tests? I > probably don't have time to try to repeat them now, but I can at least take a > quick look at them. > I'm slightly confused by the labels on the graph. They can't possibly mean > that 2^16384 KB record sizes were tested. Was that just up to 16MB records > and 16GB files? That would make a lot more sense. I just did something like: cd /mnt/tmp (where the cephfs was mounted) iozone -a > /tmp/iozone.log By default, iozone does its tests in the current working directory. The graphs were just produced with the Generate_Graphs script that comes with iozone. There are certainly some problems with the axis labeling, but I think your interpretation is correct. mark.nelson@inktank.com said: > This might be a dumb question, but was the ceph version of this test on a > single client on gigabit Ethernet? If so, wouldn't that be the reason you > are maxing out at like 114MB/s? Duh. You're exactly right. I should have noticed this. And finally: tv@inktank.com said: > If you want to benchmark just the metadata part, rsync with 0-size files might > actually be an interesting workload. I'll see if I can work out a way to do this. Thanks to everyone for the suggestions. Bryan -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@virginia.edu ======================================================================== -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-03 14:55 ` Bryan K. Wright @ 2012-10-03 18:35 ` Gregory Farnum 2012-10-04 13:14 ` Bryan K. Wright 0 siblings, 1 reply; 23+ messages in thread From: Gregory Farnum @ 2012-10-03 18:35 UTC (permalink / raw) To: bryan; +Cc: ceph-devel I think I'm with Mark now — this does indeed look like too much random IO for the disks to handle. In particular, Ceph requires that each write be synced to disk before it's considered complete, which rsync definitely doesn't. In the filesystem this is generally disguised fairly well by all the caches and such in the way, but this use case is unfriendly to that arrangement. However, I am particularly struck by seeing one of your OSDs at 96% disk utilization while the others remain <50%, and I've just realized we never saw output from ceph -s. Can you provide that, please? -Greg On Wed, Oct 3, 2012 at 7:55 AM, Bryan K. Wright <bkw1a@ayesha.phys.virginia.edu> wrote: > Hi again, > > A few answers to questions from various people on the list > after my last e-mail: > > greg@inktank.com said: >> Yes. Bryan, you mentioned that you didn't see a lot of resource usage — was it >> perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in >> theory, but in practice it has the equivalent of a Big Kernel Lock so it's not >> going to get much past one cpu core of time... > > The CPU usage on the MDSs hovered around a few percent. > They're quad-core machines, and I didn't see it ever get as high > as 25% usage on any of the cores while watching with atop. > > greg@inktank.com said: >> The rados bench results do indicate some pretty bad small-file write >> performance as well though, so I guess it's possible your testing is running >> long enough that the page cache isn't absorbing that hit. Did performance >> start out higher or has it been flat? > > Looking at the details of the rados benchmark output, it does > look like performance starts out better for the first few iterations, > and then goes bad. Here's the begining of a typical small-file run: > > Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds. > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > 1 255 3683 3428 13.3894 13.3906 0.002569 0.0696906 > 2 256 7561 7305 14.2661 15.1445 0.106437 0.0669534 > 3 256 10408 10152 13.2173 11.1211 0.002176 0.0689543 > 4 256 11256 11000 10.741 3.3125 0.002097 0.0846414 > 5 256 11256 11000 8.5928 0 - 0.0846414 > 6 256 11370 11114 7.23489 0.222656 0.002399 0.0962989 > 7 255 12480 12225 6.82126 4.33984 0.117658 0.142335 > 8 256 13289 13033 6.36311 3.15625 0.002574 0.151261 > 9 256 13737 13481 5.85051 1.75 0.120657 0.158865 > 10 256 14341 14085 5.50138 2.35938 0.022544 0.178298 > > I see the same behavior every time I repeat the small-file > rados benchmark. Here's a graph showing the first 100 "cur MB/s" values > for a short-file benchmark: > > http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf > > On the other hand, with 4MB files, I see results that start out like > this: > > Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds. > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > 1 49 49 0 0 0 - 0 > 2 76 76 0 0 0 - 0 > 3 105 105 0 0 0 - 0 > 4 133 133 0 0 0 - 0 > 5 159 159 0 0 0 - 0 > 6 188 188 0 0 0 - 0 > 7 218 218 0 0 0 - 0 > 8 246 246 0 0 0 - 0 > 9 256 274 18 7.99904 8 8.97759 8.66218 > 10 255 301 46 18.3978 112 9.1456 8.94095 > 11 255 330 75 27.2695 116 9.06968 9.013 > 12 255 358 103 34.3292 112 9.12486 9.04374 > > Here's a graph showing the first 100 "cur MB/s" values for a typical > 4MB file benchmark: > > http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf > > mark.nelson@inktank.com said: >> When you were doing this, what kind of results did collectl give you for >> average write sizes to the underlying OSD disks? > > The average "rwsize" reported by collectl hovered around > 6 +/- a few (in whatever units collectl reports) for the RAID > array, and around 15 for the journal SSD, while doing the small-file > rados benchmark. Here's a screenshot showing atop running on > each of the MDS hosts, and collectl running on each of the OSD > hosts, while the benchmark was running: > > http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png > > Here's the same, but with collectl running on the MDSs instead of atop: > > http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png > > Looking at the last screenshot again, it does look like the disks on > the MDSs are getting some exercise, with ~40% utilization (if I'm > interpreting the collectl output correctly). > > Here's a similar snapshot for the 4MB test: > > http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png > > It looks like similar "pct util" on the MDS disks, but much higher > average rwsize values on the OSDs. > > mark.nelson@inktank.com said: >> There's multiple issues potentially here. Part of it might be how writes are >> coalesced by XFS in each scenario. Part of it might also be overhead due to >> XFS metadata reads/writes. You could probably get a better idea of both of >> these by running blktrace during the tests and making seekwatcher movies of >> the results. You not only can look at the numbers of seeks, but also the >> kind (read/writes) and where on the disk they are going. That, and some of >> the raw blktrace data can give you a lot of information about what is going >> on and whether or not seeks are > > I'll take a look at blktrace and see what I can find out. > > mark.nelson@inktank.com said: >> Beyond that, I do think you are correct in suspecting that there are some >> Ceph limitations as well. Some things that may be interesting to try: > >> - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various >> thread counts - Increasing various op and byte limits (such as >> journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under >> the OSDs. > > And I'll give some of these a try. > > Regarding the iozone benchmarks: > mark.nelson@inktank.com said: >> Do you happen to have the settings you used when you ran these tests? I >> probably don't have time to try to repeat them now, but I can at least take a >> quick look at them. >> I'm slightly confused by the labels on the graph. They can't possibly mean >> that 2^16384 KB record sizes were tested. Was that just up to 16MB records >> and 16GB files? That would make a lot more sense. > > I just did something like: > > cd /mnt/tmp (where the cephfs was mounted) > iozone -a > /tmp/iozone.log > > By default, iozone does its tests in the current working directory. > The graphs were just produced with the Generate_Graphs script > that comes with iozone. There are certainly some problems with > the axis labeling, but I think your interpretation is correct. > > mark.nelson@inktank.com said: >> This might be a dumb question, but was the ceph version of this test on a >> single client on gigabit Ethernet? If so, wouldn't that be the reason you >> are maxing out at like 114MB/s? > > Duh. You're exactly right. I should have noticed this. > > And finally: > tv@inktank.com said: >> If you want to benchmark just the metadata part, rsync with 0-size files might >> actually be an interesting workload. > > I'll see if I can work out a way to do this. > > Thanks to everyone for the suggestions. > Bryan > -- > ======================================================================== > Bryan Wright |"If you take cranberries and stew them like > Physics Department | applesauce, they taste much more like prunes > University of Virginia | than rhubarb does." -- Groucho > Charlottesville, VA 22901| > (434) 924-7218 | bryan@virginia.edu > ======================================================================== > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-03 18:35 ` Gregory Farnum @ 2012-10-04 13:14 ` Bryan K. Wright 2012-10-04 15:24 ` Sage Weil 0 siblings, 1 reply; 23+ messages in thread From: Bryan K. Wright @ 2012-10-04 13:14 UTC (permalink / raw) To: ceph-devel Hi Greg, greg@inktank.com said: > I think I'm with Mark now this does indeed look like too much random IO for > the disks to handle. In particular, Ceph requires that each write be synced to > disk before it's considered complete, which rsync definitely doesn't. In the > filesystem this is generally disguised fairly well by all the caches and such > in the way, but this use case is unfriendly to that arrangement. > However, I am particularly struck by seeing one of your OSDs at 96% disk > utilization while the others remain <50%, and I've just realized we never saw > output from ceph -s. Can you provide that, please? Here's the ceph -s output: health HEALTH_OK monmap e1: 3 mons at {0=192.168.1.31:6789/0,1=192.168.1.32:6789/0,2=192.168.1 .33:6789/0}, election epoch 2, quorum 0,1,2 0,1,2 osdmap e24: 4 osds: 4 up, 4 in pgmap v8363: 960 pgs: 960 active+clean; 15099 MB data, 38095 MB used, 74354 GB / 74391 GB avail mdsmap e25: 1/1/1 up {0=2=up:active}, 2 up:standby The OSD disk utilization seems to vary a lot during these benchmarks. My recollection is that each of the OSD hosts sometimes sees near-100% utilization. Bryan -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@virginia.edu ======================================================================== -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-04 13:14 ` Bryan K. Wright @ 2012-10-04 15:24 ` Sage Weil 2012-10-04 15:54 ` Bryan K. Wright 0 siblings, 1 reply; 23+ messages in thread From: Sage Weil @ 2012-10-04 15:24 UTC (permalink / raw) To: bryan; +Cc: ceph-devel On Thu, 4 Oct 2012, Bryan K. Wright wrote: > Hi Greg, > > greg@inktank.com said: > > I think I'm with Mark now ? this does indeed look like too much random IO for > > the disks to handle. In particular, Ceph requires that each write be synced to > > disk before it's considered complete, which rsync definitely doesn't. In the > > filesystem this is generally disguised fairly well by all the caches and such > > in the way, but this use case is unfriendly to that arrangement. > > > However, I am particularly struck by seeing one of your OSDs at 96% disk > > utilization while the others remain <50%, and I've just realized we never saw > > output from ceph -s. Can you provide that, please? > > Here's the ceph -s output: > > health HEALTH_OK > monmap e1: 3 mons at {0=192.168.1.31:6789/0,1=192.168.1.32:6789/0,2=192.168.1 > .33:6789/0}, election epoch 2, quorum 0,1,2 0,1,2 > osdmap e24: 4 osds: 4 up, 4 in > pgmap v8363: 960 pgs: 960 active+clean; 15099 MB data, 38095 MB used, 74354 > GB / 74391 GB avail > mdsmap e25: 1/1/1 up {0=2=up:active}, 2 up:standby > > The OSD disk utilization seems to vary a lot during these > benchmarks. My recollection is that each of the OSD hosts sometimes > sees near-100% utilization. Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump' output? So we can make sure CRUSH is distributing things well? Thanks! sage > > Bryan > > > -- > ======================================================================== > Bryan Wright |"If you take cranberries and stew them like > Physics Department | applesauce, they taste much more like prunes > University of Virginia | than rhubarb does." -- Groucho > Charlottesville, VA 22901| > (434) 924-7218 | bryan@virginia.edu > ======================================================================== > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-04 15:24 ` Sage Weil @ 2012-10-04 15:54 ` Bryan K. Wright 2012-10-26 20:48 ` Gregory Farnum 0 siblings, 1 reply; 23+ messages in thread From: Bryan K. Wright @ 2012-10-04 15:54 UTC (permalink / raw) To: ceph-devel Hi Sage, sage@inktank.com said: > Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump' > output? So we can make sure CRUSH is distributing things well? Here they are: # ceph osd tree dumped osdmap tree epoch 24 # id weight type name up/down reweight -1 4 pool default -3 4 rack unknownrack -2 1 host ceph-osd-1 1 1 osd.1 up 1 -4 1 host ceph-osd-2 2 1 osd.2 up 1 -5 1 host ceph-osd-3 3 1 osd.3 up 1 -6 1 host ceph-osd-4 4 1 osd.4 up 1 # ceph osd dump dumped osdmap epoch 24 epoch 24 fsid 7e4e4302-4ced-439e-9786-49e6036dfda4 created 2012-09-28 13:17:40.774580 modifed 2012-09-28 16:56:02.864965 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 max_osd 5 osd.1 up in weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [10,15) 192.168.1.21:6800/3702 192.168.12.21:6800/3702 192.168.12.21:6801/3702 exists,up 4ad0b4cd-cbff-4693-b8f7-667148386cf3 osd.2 up in weight 1 up_from 17 up_thru 21 down_at 16 last_clean_interval [8,15) 192.168.1.22:6800/3428 192.168.12.22:6800/3428 192.168.12.22:6801/3428 exists,up 6a829cc6-fc60-450a-ac1d-8e148b757e57 osd.3 up in weight 1 up_from 21 up_thru 21 down_at 20 last_clean_interval [9,15) 192.168.1.23:6800/3436 192.168.12.23:6800/3436 192.168.12.23:6801/3436 exists,up 387cff7a-b857-434b-af66-0e08f56fd0f7 osd.4 up in weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [9,15) 192.168.1.24:6800/3486 192.168.12.24:6800/3486 192.168.12.24:6801/3486 exists,up fe8c4bf0-ff6f-41e9-91ac-d5826672f8b5 # ceph pg dump See http://ayesha.phys.virginia.edu/~bryan/ceph-pg-dump.txt Bryan -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@virginia.edu ======================================================================== ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-04 15:54 ` Bryan K. Wright @ 2012-10-26 20:48 ` Gregory Farnum 2012-10-29 15:08 ` Bryan K. Wright 0 siblings, 1 reply; 23+ messages in thread From: Gregory Farnum @ 2012-10-26 20:48 UTC (permalink / raw) To: bryan; +Cc: ceph-devel On Thu, Oct 4, 2012 at 8:54 AM, Bryan K. Wright <bkw1a@ayesha.phys.virginia.edu> wrote: > Hi Sage, > > sage@inktank.com said: >> Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump' >> output? So we can make sure CRUSH is distributing things well? > > Here they are: > > # ceph osd tree > dumped osdmap tree epoch 24 > # id weight type name up/down reweight > -1 4 pool default > -3 4 rack unknownrack > -2 1 host ceph-osd-1 > 1 1 osd.1 up 1 > -4 1 host ceph-osd-2 > 2 1 osd.2 up 1 > -5 1 host ceph-osd-3 > 3 1 osd.3 up 1 > -6 1 host ceph-osd-4 > 4 1 osd.4 up 1 > > # ceph osd dump > dumped osdmap epoch 24 > epoch 24 > fsid 7e4e4302-4ced-439e-9786-49e6036dfda4 > created 2012-09-28 13:17:40.774580 > modifed 2012-09-28 16:56:02.864965 > flags > > pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 crash_replay_interval 45 > pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 > pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 > > max_osd 5 > osd.1 up in weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [10,15) 192.168.1.21:6800/3702 192.168.12.21:6800/3702 192.168.12.21:6801/3702 exists,up 4ad0b4cd-cbff-4693-b8f7-667148386cf3 > osd.2 up in weight 1 up_from 17 up_thru 21 down_at 16 last_clean_interval [8,15) 192.168.1.22:6800/3428 192.168.12.22:6800/3428 192.168.12.22:6801/3428 exists,up 6a829cc6-fc60-450a-ac1d-8e148b757e57 > osd.3 up in weight 1 up_from 21 up_thru 21 down_at 20 last_clean_interval [9,15) 192.168.1.23:6800/3436 192.168.12.23:6800/3436 192.168.12.23:6801/3436 exists,up 387cff7a-b857-434b-af66-0e08f56fd0f7 > osd.4 up in weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [9,15) 192.168.1.24:6800/3486 192.168.12.24:6800/3486 192.168.12.24:6801/3486 exists,up fe8c4bf0-ff6f-41e9-91ac-d5826672f8b5 > > # ceph pg dump > See http://ayesha.phys.virginia.edu/~bryan/ceph-pg-dump.txt Eeek, I was going through my email backlog and came across this thread again. Everything here does look good; the data distribution etc is pretty reasonable. If you're still testing, we can at least get a rough idea of the sorts of IO the OSD is doing by looking at the perfcounters out of the admin socket: ceph --admin-daemon /path/to/socket perf dump (I believe the default path is /var/run/ceph/ceph-osd.*.asok) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-26 20:48 ` Gregory Farnum @ 2012-10-29 15:08 ` Bryan K. Wright 2012-11-03 17:55 ` Gregory Farnum 0 siblings, 1 reply; 23+ messages in thread From: Bryan K. Wright @ 2012-10-29 15:08 UTC (permalink / raw) To: Gregory Farnum; +Cc: bryan, ceph-devel greg@inktank.com said: > Eeek, I was going through my email backlog and came across this thread again. > Everything here does look good; the data distribution etc is pretty > reasonable. If you're still testing, we can at least get a rough idea of the > sorts of IO the OSD is doing by looking at the perfcounters out of the admin > socket: ceph --admin-daemon /path/to/socket perf dump (I believe the default > path is /var/run/ceph/ceph-osd.*.asok) Hi Greg, Thanks for your help. I've been experimenting with other things, so the cluster has a different arrangement now, but the performance seems to be about the same. I've now broken down the RAID arrays into JBOD disks, and I'm running one OSD per disk, recklessly ignoring the warning about syncfs being missing. (Performance doesn't seem any better or worse than it was before when rsyncing a large directory of small files.) I've also added another osd node into the mix, with a different disk controller. For what it's worth, here are "perf dump" outputs for a couple of OSDs running on the old and new hardware, respectively: http://ayesha.phys.virginia.edu/~bryan/perf.osd.200.txt http://ayesha.phys.virginia.edu/~bryan/perf.osd.100.txt If you could take a look at them and let me know if you see anything enlightening, I'd really appreciate it. Thanks, Bryan -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@virginia.edu ======================================================================== ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-29 15:08 ` Bryan K. Wright @ 2012-11-03 17:55 ` Gregory Farnum 0 siblings, 0 replies; 23+ messages in thread From: Gregory Farnum @ 2012-11-03 17:55 UTC (permalink / raw) To: Bryan K. Wright, Samuel Just; +Cc: bryan, ceph-devel@vger.kernel.org On Mon, Oct 29, 2012 at 4:08 PM, Bryan K. Wright <bkw1a@ayesha.phys.virginia.edu> wrote: > > greg@inktank.com said: >> Eeek, I was going through my email backlog and came across this thread again. >> Everything here does look good; the data distribution etc is pretty >> reasonable. If you're still testing, we can at least get a rough idea of the >> sorts of IO the OSD is doing by looking at the perfcounters out of the admin >> socket: ceph --admin-daemon /path/to/socket perf dump (I believe the default >> path is /var/run/ceph/ceph-osd.*.asok) > > Hi Greg, > > Thanks for your help. I've been experimenting with other things, > so the cluster has a different arrangement now, but the performance > seems to be about the same. I've now broken down the RAID arrays into > JBOD disks, and I'm running one OSD per disk, recklessly ignoring > the warning about syncfs being missing. (Performance doesn't seem > any better or worse than it was before when rsyncing a large directory > of small files.) I've also added another osd node into the mix, with > a different disk controller. > > For what it's worth, here are "perf dump" outputs for a > couple of OSDs running on the old and new hardware, respectively: > > http://ayesha.phys.virginia.edu/~bryan/perf.osd.200.txt > http://ayesha.phys.virginia.edu/~bryan/perf.osd.100.txt > > If you could take a look at them and let me know if you see > anything enlightening, I'd really appreciate it. Sam, can you check these out? I notice in particular that the average "apply_latency" is 1.44 seconds — but I don't know if I have the units right on that or have parsed something else wrong. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-01 16:47 ` Tommi Virtanen 2012-10-01 17:00 ` Gregory Farnum @ 2012-10-01 17:03 ` Mark Nelson 1 sibling, 0 replies; 23+ messages in thread From: Mark Nelson @ 2012-10-01 17:03 UTC (permalink / raw) To: Tommi Virtanen; +Cc: Gregory Farnum, bryan, ceph-devel On 10/01/2012 11:47 AM, Tommi Virtanen wrote: > On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum<greg@inktank.com> wrote: >> However, my suspicion is that you're limited by metadata throughput >> here. How large are your files? There might be some MDS or client >> tunables we can adjust, but rsync's workload is a known weak spot for >> CephFS. > > I feel like people are missing this part of Greg's message. Everyone > is so busy benchmarking RADOS small I/O, but what if it's currently > bottlenecked by all the file-level access operations that interact > with the MDS? Rsync causes a ton of those. > > If you want to benchmark just the small IO, you can't compare rsync to rsync. > > If you want to benchmark just the metadata part, rsync with 0-size > files might actually be an interesting workload. I guess most of the small IO testing we've seen/done has been without CephFS at all. It's entirely possible that the MDS is slowing things down with an rsync workload like this on a fresh filesystem though. Having said that, I don't like the way that our small IO performance behaves (especially over time) when doing something like RADOS Bench. It definitely seems like there is some pretty nasty underlying filesystem metadata fragmentation or something going on after a while. Mark ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-26 20:54 ` Bryan K. Wright 2012-09-27 15:16 ` Bryan K. Wright 2012-09-27 18:04 ` Gregory Farnum @ 2012-09-27 23:40 ` Mark Kirkwood 2012-09-27 23:49 ` Mark Kirkwood 2 siblings, 1 reply; 23+ messages in thread From: Mark Kirkwood @ 2012-09-27 23:40 UTC (permalink / raw) To: bryan; +Cc: Bryan K. Wright, Mark Nelson, ceph-devel Bryan - Note that the default block size for the rados bench is 4MB...and performance decreases quite dramatically with smaller block sizes (-b option to rados bench). On 27/09/12 08:54, Bryan K. Wright wrote: > > The rados benchmark was run on one of the OSD > machines. Read and write results looked like this (the > objects size was just the default, which seems to be 4kB): > > # rados bench -p pbench 900 write > Total time run: 900.549729 > Total writes made: 33819 > Write size: 4194304 > Bandwidth (MB/sec): 150.215 > > Stddev Bandwidth: 16.2592 > Max bandwidth (MB/sec): 212 > Min bandwidth (MB/sec): 84 > Average Latency: 0.426028 > Stddev Latency: 0.24688 > Max latency: 1.59936 > Min latency: 0.06794 > > # rados bench -p pbench 900 seq > Total time run: 900.572788 > Total reads made: 33676 > Read size: 4194304 > Bandwidth (MB/sec): 149.576 > > Average Latency: 0.427844 > Max latency: 1.48576 > Min latency: 0.015371 > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-27 23:40 ` Mark Kirkwood @ 2012-09-27 23:49 ` Mark Kirkwood 2012-09-28 12:22 ` mark seger 0 siblings, 1 reply; 23+ messages in thread From: Mark Kirkwood @ 2012-09-27 23:49 UTC (permalink / raw) To: bryan; +Cc: Bryan K. Wright, Mark Nelson, ceph-devel Sorry Bryan - I should have read further down the thread and noted that you have this figured out... nothing to see here! On 28/09/12 11:40, Mark Kirkwood wrote: > Bryan - > > Note that the default block size for the rados bench is 4MB...and > performance decreases quite dramatically with smaller block sizes (-b > option to rados bench). > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-27 23:49 ` Mark Kirkwood @ 2012-09-28 12:22 ` mark seger 2012-10-01 15:41 ` Bryan K. Wright 0 siblings, 1 reply; 23+ messages in thread From: mark seger @ 2012-09-28 12:22 UTC (permalink / raw) To: ceph-devel I realize I'm a little late to this party but since collectl was mentioned thought I'd jump in. ;) Whenever I do any file system testing I also have a copy of collectl running in another window. Just looking at total transfer times can end up taking you down the wrong path. What is there are long stalls and very burst I/O? could be a starved resource or network issue that has nothing to do with he disks at all. As for iostat, while you're certainly welcome to use it and I based the collectl output display format on it, I'd highly recommend using iostat -x to see wait/service times as those can be key to seeing what's happening. Also, if you use collectl in stead with "-sD --home" you'll basically see the output in a top-like format, making it real easy to see what's happening. Further if you apply the right filter you can simply watch a single disk, line by line w/o any pesky headers in your way. -mark ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-09-28 12:22 ` mark seger @ 2012-10-01 15:41 ` Bryan K. Wright 2012-10-01 16:43 ` Mark Nelson 0 siblings, 1 reply; 23+ messages in thread From: Bryan K. Wright @ 2012-10-01 15:41 UTC (permalink / raw) To: ceph-devel Hi again, I've fiddled around a lot with journal settings, so to make sure I'm comparing apples to apples, I went back and systematically re-ran the benchmark tests I've been running (and some more). A long data dump follows, but the end result is that it does look like something fishy is going on for small file sizes. For example, performance difference between 4MB and 4KB files in the rados write benchmark is a factor of 25 or more. Here are the details, with a recap of the configuration at the end. I started out by remaking the underlying xfs filesystems on the OSD hosts, and then rerunning mkcephfs. The journals are 120 GB SSDs. First, the rsync tests again: * Rsync of ~60 GB directory tree (mostly small files) from ceph client to mounted cephfs goes at about 5.2 MB/s. * I then turned off ceph (service ceph -a stop) and did the same rsync between the same two hosts, onto the same RAID array on one of the OSD hosts, but using ssh this time. This time it goes at about 37 MB/s. This implies to me that the slowdown is somewhere in ceph, not in the RAID array or the network connectivity. I then remade the xfs filessytems again, re-ran mkcephfs, restarted ceph and did some rados benchmarks. * rados bench -p pbench 900 write -t 256 -b 4096 Total time run: 900.184096 Total writes made: 1052511 Write size: 4096 Bandwidth (MB/sec): 4.567 Stddev Bandwidth: 4.34241 Max bandwidth (MB/sec): 23.1719 Min bandwidth (MB/sec): 0 Average Latency: 0.218949 Stddev Latency: 0.566181 Max latency: 9.92952 Min latency: 0.001449 * rados bench -p pbench 900 write -t 256 (default 4MB size) Total time run: 900.816140 Total writes made: 25263 Write size: 4194304 Bandwidth (MB/sec): 112.178 Stddev Bandwidth: 27.1239 Max bandwidth (MB/sec): 840 Min bandwidth (MB/sec): 0 Average Latency: 9.08281 Stddev Latency: 0.505372 Max latency: 9.31865 Min latency: 0.818949 I repeated each of these benchmarks three times, but saw similar results each time (a factor of 25 or more in speed between small and large object sizes). Next, I stopped ceph and took a look at local RAID performance as a function of file size using "iozone": http://ayesha.phys.virginia.edu/~bryan/iozone-write-local-raid.pdf Then I re-made the ceph filesystem and restarted ceph, and used iozone on the ceph client to look at the mounted ceph filesystem: http://ayesha.phys.virginia.edu/~bryan/iozone-write-cephfs.pdf I'm not sure how to interpret the iozone performance numbers, but the distribution certainly looks much less uniform across different file and chunk sizes for the mounted ceph filesystem. Finally, I took a look at the results of bonnie++ benchmarks for I/O directly to the RAID array, or to the mounted ceph filesystem. * Looking at RAID array from one of the OSD hosts: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP RAID on OSD 23800M 1155 99 318264 26 132959 19 2884 99 293464 20 535.4 23 Latency 7354us 30955us 129ms 8220us 119ms 62188us Version 1.96 ------Sequential Create------ --------Random Create-------- RAID on OSD -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 17680 58 +++++ +++ 26994 78 24715 81 +++++ +++ 26597 78 Latency 113us 105us 153us 109us 15us 94us * Looking at the mounted ceph filesystem from the ceph client: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP cephfs, client 16G 1101 95 114623 8 45713 2 2665 98 133537 3 882.0 14 Latency 44515us 37018us 6437ms 12747us 469ms 60004us Version 1.96 ------Sequential Create------ --------Random Create-------- cephfs, client -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 653 3 19886 9 601 3 746 3 +++++ +++ 585 2 Latency 1171ms 7467us 174ms 104ms 19us 228ms This seems to show about a factor of 3 difference in speed between writing to the mounted ceph filesystem and writing directly to the RAID array. While I was doing these, I kept an eye on the OSDs and MDSs with collectl and atop, but I didn't see anything that looked like an obvious problem. The MDSs didn't see very high CPU, I/O or memory usage, for example. Finally, to recap the configuration: 3 MDS hosts 4 OSD hosts, each with a RAID array for object storage and an SSD journal xfs filesystems for the object stores gigabit network on the front end, and a separate back end gigabit network for the ceph hosts. 64-bit CentOS 6.3 and ceph 0.48.2 everywhere ceph servers running stock CentOS 2.6.32-279.9.1 kernel. client running "elrepo" 3.5.4-1 kernel. Bryan -- ======================================================================== Bryan Wright |"If you take cranberries and stew them like Physics Department | applesauce, they taste much more like prunes University of Virginia | than rhubarb does." -- Groucho Charlottesville, VA 22901| (434) 924-7218 | bryan@virginia.edu ======================================================================== ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Slow ceph fs performance 2012-10-01 15:41 ` Bryan K. Wright @ 2012-10-01 16:43 ` Mark Nelson 0 siblings, 0 replies; 23+ messages in thread From: Mark Nelson @ 2012-10-01 16:43 UTC (permalink / raw) To: bryan; +Cc: Bryan K. Wright, ceph-devel On 10/01/2012 10:41 AM, Bryan K. Wright wrote: > Hi again, > Hello! > I've fiddled around a lot with journal settings, so > to make sure I'm comparing apples to apples, I went back and > systematically re-ran the benchmark tests I've been running > (and some more). A long data dump follows, but the end result > is that it does look like something fishy is going on for small > file sizes. For example, performance difference between 4MB > and 4KB files in the rados write benchmark is a factor of 25 or > more. Here are the details, with a recap of the configuration > at the end. > Probably one of the most important things to think about when dealing with small IOs on spinning disks is how well the operating system / file system combine small writes into larger ones. With spinning disks you get so few iops to work with that your throughput is almost entirely governed by seek behavior. There are many possible reasons for slow performance, but this should always be something you keep in mind during your tests. > I started out by remaking the underlying xfs filesystems > on the OSD hosts, and then rerunning mkcephfs. The journals > are 120 GB SSDs. > > First, the rsync tests again: > > * Rsync of ~60 GB directory tree (mostly small files) from ceph client > to mounted cephfs goes at about 5.2 MB/s. > When you were doing this, what kind of results did collectl give you for average write sizes to the underlying OSD disks? > * I then turned off ceph (service ceph -a stop) and did the same > rsync between the same two hosts, onto the same RAID array on > one of the OSD hosts, but using ssh this time. This time it > goes at about 37 MB/s. > > This implies to me that the slowdown is somewhere in ceph, not in > the RAID array or the network connectivity. > There's multiple issues potentially here. Part of it might be how writes are coalesced by XFS in each scenario. Part of it might also be overhead due to XFS metadata reads/writes. You could probably get a better idea of both of these by running blktrace during the tests and making seekwatcher movies of the results. You not only can look at the numbers of seeks, but also the kind (read/writes) and where on the disk they are going. That, and some of the raw blktrace data can give you a lot of information about what is going on and whether or not seeks are related to metadata. Beyond that, I do think you are correct in suspecting that there are some Ceph limitations as well. Some things that may be interesting to try: - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various thread counts - Increasing various op and byte limits (such as journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under the OSDs. > I then remade the xfs filessytems again, re-ran mkcephfs, > restarted ceph and did some rados benchmarks. > > * rados bench -p pbench 900 write -t 256 -b 4096 > Total time run: 900.184096 > Total writes made: 1052511 > Write size: 4096 > Bandwidth (MB/sec): 4.567 > > Stddev Bandwidth: 4.34241 > Max bandwidth (MB/sec): 23.1719 > Min bandwidth (MB/sec): 0 > Average Latency: 0.218949 > Stddev Latency: 0.566181 > Max latency: 9.92952 > Min latency: 0.001449 > XFS does pretty poorly with RADOS bench at small IO sizes from what I've seen. EXT4 and BTRFS tend to do better, but probably not more than 2-3 times better. > > * rados bench -p pbench 900 write -t 256 (default 4MB size) > Total time run: 900.816140 > Total writes made: 25263 > Write size: 4194304 > Bandwidth (MB/sec): 112.178 > > Stddev Bandwidth: 27.1239 > Max bandwidth (MB/sec): 840 > Min bandwidth (MB/sec): 0 > Average Latency: 9.08281 > Stddev Latency: 0.505372 > Max latency: 9.31865 > Min latency: 0.818949 > I imagine your Max throughput for 4MB IOs is being limited by the network here. You may be able to get higher aggregate performance by running rados bench on multiple clients concurrently. > I repeated each of these benchmarks three times, but saw > similar results each time (a factor of 25 or more in speed between > small and large object sizes). > > Next, I stopped ceph and took a look at local RAID > performance as a function of file size using "iozone": > > http://ayesha.phys.virginia.edu/~bryan/iozone-write-local-raid.pdf > > Then I re-made the ceph filesystem and restarted ceph, and used > iozone on the ceph client to look at the mounted ceph filesystem: > > http://ayesha.phys.virginia.edu/~bryan/iozone-write-cephfs.pdf > Do you happen to have the settings you used when you ran these tests? I probably don't have time to try to repeat them now, but I can at least take a quick look at them. > I'm not sure how to interpret the iozone performance numbers, > but the distribution certainly looks much less uniform across > different file and chunk sizes for the mounted ceph filesystem. > Indeed. Some of that is to be expected just because of the increased complexity and number of ways that things can get backed up in a distributed system like Ceph. Having said that, the trench in the middle of the Ceph distribution is interesting. I wouldn't mind digging into that more. I'm slightly confused by the labels on the graph. They can't possibly mean that 2^16384 KB record sizes were tested. Was that just up to 16MB records and 16GB files? That would make a lot more sense. > Finally, I took a look at the results of bonnie++ > benchmarks for I/O directly to the RAID array, or to the > mounted ceph filesystem. > > * Looking at RAID array from one of the OSD hosts: > Version 1.96 ------Sequential Output------ --Sequential Input- --Random- > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > RAID on OSD 23800M 1155 99 318264 26 132959 19 2884 99 293464 20 535.4 23 > Latency 7354us 30955us 129ms 8220us 119ms 62188us > Version 1.96 ------Sequential Create------ --------Random Create-------- > RAID on OSD -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 17680 58 +++++ +++ 26994 78 24715 81 +++++ +++ 26597 78 > Latency 113us 105us 153us 109us 15us 94us > > * Looking at the mounted ceph filesystem from the ceph client: > Version 1.96 ------Sequential Output------ --Sequential Input- --Random- > Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > cephfs, client 16G 1101 95 114623 8 45713 2 2665 98 133537 3 882.0 14 > Latency 44515us 37018us 6437ms 12747us 469ms 60004us > Version 1.96 ------Sequential Create------ --------Random Create-------- > cephfs, client -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 653 3 19886 9 601 3 746 3 +++++ +++ 585 2 > Latency 1171ms 7467us 174ms 104ms 19us 228ms > > This seems to show about a factor of 3 difference in speed between > writing to the mounted ceph filesystem and writing directly to the RAID > array. This might be a dumb question, but was the ceph version of this test on a single client on gigabit Ethernet? If so, wouldn't that be the reason you are maxing out at like 114MB/s? > > While I was doing these, I kept an eye on the OSDs and MDSs > with collectl and atop, but I didn't see anything that looked > like an obvious problem. The MDSs didn't see very high CPU, I/O > or memory usage, for example. > > Finally, to recap the configuration: > > 3 MDS hosts > 4 OSD hosts, each with a RAID array for object storage and an SSD journal > xfs filesystems for the object stores > gigabit network on the front end, and a separate back end gigabit network for the ceph hosts. > 64-bit CentOS 6.3 and ceph 0.48.2 everywhere > ceph servers running stock CentOS 2.6.32-279.9.1 kernel. > client running "elrepo" 3.5.4-1 kernel. > > Bryan > Mark ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2012-11-03 17:55 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-09-26 14:50 Slow ceph fs performance Bryan K. Wright 2012-09-26 15:26 ` Mark Nelson 2012-09-26 20:54 ` Bryan K. Wright 2012-09-27 15:16 ` Bryan K. Wright 2012-09-27 18:04 ` Gregory Farnum 2012-09-27 18:47 ` Bryan K. Wright 2012-09-27 19:47 ` Gregory Farnum 2012-10-01 16:47 ` Tommi Virtanen 2012-10-01 17:00 ` Gregory Farnum 2012-10-03 14:55 ` Bryan K. Wright 2012-10-03 18:35 ` Gregory Farnum 2012-10-04 13:14 ` Bryan K. Wright 2012-10-04 15:24 ` Sage Weil 2012-10-04 15:54 ` Bryan K. Wright 2012-10-26 20:48 ` Gregory Farnum 2012-10-29 15:08 ` Bryan K. Wright 2012-11-03 17:55 ` Gregory Farnum 2012-10-01 17:03 ` Mark Nelson 2012-09-27 23:40 ` Mark Kirkwood 2012-09-27 23:49 ` Mark Kirkwood 2012-09-28 12:22 ` mark seger 2012-10-01 15:41 ` Bryan K. Wright 2012-10-01 16:43 ` Mark Nelson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.