Slow ceph fs performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* Slow ceph fs performance
@ 2012-09-26 14:50 Bryan K. Wright
  2012-09-26 15:26 ` Mark Nelson
  0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-26 14:50 UTC (permalink / raw)
  To: ceph-devel

Hi folks,

	I'm seeing reasonable performance when I run rados
benchmarks, but really slow I/O when reading or writing 
from a mounted ceph filesystem.  The rados benchmarks
show about 150 MB/s for both read and write, but when I
go to a client machine with a mounted ceph filesystem
and try to rsync a large (60 GB) directory tree onto
the ceph fs, I'm getting rates of only 2-5 MB/s.

	The OSDs and MDSs are all running 64-bit CentOS 6.3
with the stock CentOS 2.6.32 kernel.  The client is also
64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel.
There are four OSDs, each with a hardware RAID 5 array
and an SSD for the OSD journal.  The primary network
is a gigabit network, and the OSD, MDS and MON 
machines have a dedicated backend gigabit network on a 
second network interface.

	Locally on the OSD, "hdparm -t -T" reports read rates 
of ~350 MB/s, and bonnie++ shows:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
osd-local    23800M  1037  99 316048  92 131023  19  2272  98 312781  21 521.0  24
Latency             13103us     183ms     123ms   15316us     100ms   75899us
Version  1.96       ------Sequential Create------ --------Random Create--------
osd-local           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16817  55 +++++ +++ 28786  77 23890  78 +++++ +++ 27128  75
Latency             21549us     105us     134us     902us      12us     104us


	While rsyncing the files, the ceph logs show lots
of warnings of the form:

[WRN] : slow request 91.848407 seconds old, received at 2012-09-26 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops

	Snooping on traffic with wireshark shows bursts of 
activity separated by long periods (30-60 sec) of idle time.

	My first thought was that I was seeing a kind of 
"bufferbloat". The SSDs are 120 GB, so they could easily contain 
enough data to take a long time to dump.  I changed to using a 
journal file, limited to 1 GB, but I still see the same slow
behavior.

	Any advice about how to go about debugging this would
be appreciated.

					Thanks,
					Bryan

-- 
========================================================================
Bryan Wright              |"If you take cranberries and stew them like 
Physics Department        | applesauce, they taste much more like prunes
University of Virginia    | than rhubarb does."  --  Groucho 
Charlottesville, VA  22901|			
(434) 924-7218            |         bryan@virginia.edu
========================================================================


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-26 14:50 Slow ceph fs performance Bryan K. Wright
@ 2012-09-26 15:26 ` Mark Nelson
  2012-09-26 20:54   ` Bryan K. Wright
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Nelson @ 2012-09-26 15:26 UTC (permalink / raw)
  To: bryan; +Cc: Bryan K. Wright, ceph-devel

On 09/26/2012 09:50 AM, Bryan K. Wright wrote:
> Hi folks,

Hi Bryan!

>
> 	I'm seeing reasonable performance when I run rados
> benchmarks, but really slow I/O when reading or writing
> from a mounted ceph filesystem.  The rados benchmarks
> show about 150 MB/s for both read and write, but when I
> go to a client machine with a mounted ceph filesystem
> and try to rsync a large (60 GB) directory tree onto
> the ceph fs, I'm getting rates of only 2-5 MB/s.

Was the rados benchmark run from the same client machine that the 
filesystem is being mounted on?  Also, what object size did you use for 
rados bench?  Does the directory tree have a lot of small files or a few 
very large ones?

>
> 	The OSDs and MDSs are all running 64-bit CentOS 6.3
> with the stock CentOS 2.6.32 kernel.  The client is also
> 64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel.
> There are four OSDs, each with a hardware RAID 5 array
> and an SSD for the OSD journal.  The primary network
> is a gigabit network, and the OSD, MDS and MON
> machines have a dedicated backend gigabit network on a
> second network interface.
>
> 	Locally on the OSD, "hdparm -t -T" reports read rates
> of ~350 MB/s, and bonnie++ shows:
>
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> osd-local    23800M  1037  99 316048  92 131023  19  2272  98 312781  21 521.0  24
> Latency             13103us     183ms     123ms   15316us     100ms   75899us
> Version  1.96       ------Sequential Create------ --------Random Create--------
> osd-local           -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>                files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                   16 16817  55 +++++ +++ 28786  77 23890  78 +++++ +++ 27128  75
> Latency             21549us     105us     134us     902us      12us     104us
>
>
> 	While rsyncing the files, the ceph logs show lots
> of warnings of the form:
>
> [WRN] : slow request 91.848407 seconds old, received at 2012-09-26 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops
>
> 	Snooping on traffic with wireshark shows bursts of
> activity separated by long periods (30-60 sec) of idle time.
>

My guess here is that if there is a lot of small IO happening, your SSD 
journal is handling it well and probably writing data really quickly, 
while your spinning disk raid5 probably can't sustain anywhere near the 
required IOPs to keep up.  So you get a burst of network traffic and the 
journal writes it to the SSD quickly until it is filled up, then the OSD 
stalls while it waits for the raid5 to write data out.  Whenever the 
journal flushes, a new burst of traffic comes in and the process repeats.

> 	My first thought was that I was seeing a kind of
> "bufferbloat". The SSDs are 120 GB, so they could easily contain
> enough data to take a long time to dump.  I changed to using a
> journal file, limited to 1 GB, but I still see the same slow
> behavior.
>
> 	Any advice about how to go about debugging this would
> be appreciated.

It'd probably be useful to look at the write sizes going to disk. 
Increasing debugging levels in the Ceph logs will give you that, but it 
can be a lot to parse.  You can also use something like iostat or 
collectl to see what the per-second average write sizes are.

>
> 					Thanks,
> 					Bryan
>

Mark

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-26 15:26 ` Mark Nelson
@ 2012-09-26 20:54   ` Bryan K. Wright
  2012-09-27 15:16     ` Bryan K. Wright
                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-26 20:54 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Hi Mark,

	Thanks for your help.  Some answers to your questions
are below.

mark.nelson@inktank.com said:
> On 09/26/2012 09:50 AM, Bryan K. Wright wrote:
> Hi folks,
> Hi Bryan!
> >
> 	I'm seeing reasonable performance when I run rados
> benchmarks, but really slow I/O when reading or writing
> from a mounted ceph filesystem.  The rados benchmarks
> show about 150 MB/s for both read and write, but when I
> go to a client machine with a mounted ceph filesystem
> and try to rsync a large (60 GB) directory tree onto
> the ceph fs, I'm getting rates of only 2-5 MB/s.
> Was the rados benchmark run from the same client machine that the  filesystem
> is being mounted on?  Also, what object size did you use for  rados bench?
> Does the directory tree have a lot of small files or a few  very large ones?

	The rados benchmark was run on one of the OSD 
machines.  Read and write results looked like this (the
objects size was just the default, which seems to be 4kB):

# rados bench -p pbench 900 write
Total time run:         900.549729
Total writes made:      33819
Write size:             4194304
Bandwidth (MB/sec):     150.215 

Stddev Bandwidth:       16.2592
Max bandwidth (MB/sec): 212
Min bandwidth (MB/sec): 84
Average Latency:        0.426028
Stddev Latency:         0.24688
Max latency:            1.59936
Min latency:            0.06794

# rados bench -p pbench 900 seq
Total time run:        900.572788
Total reads made:     33676
Read size:            4194304
Bandwidth (MB/sec):    149.576 

Average Latency:       0.427844
Max latency:           1.48576
Min latency:           0.015371

	Regarding the rsync test, yes, the directory tree
was mostly small files.

> >
> 	The OSDs and MDSs are all running 64-bit CentOS 6.3
> with the stock CentOS 2.6.32 kernel.  The client is also
> 64-bit CentOS 6.3, but it's running the "elrepo" 3.5.4 kernel.
> There are four OSDs, each with a hardware RAID 5 array
> and an SSD for the OSD journal.  The primary network
> is a gigabit network, and the OSD, MDS and MON
> machines have a dedicated backend gigabit network on a
> second network interface. >
> 	Locally on the OSD, "hdparm -t -T" reports read rates
> of ~350 MB/s, and bonnie++ shows: >
> Version  1.96       ------Sequential Output------ --Sequential Input-
> --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
> %CP
> osd-local    23800M  1037  99 316048  92 131023  19  2272  98 312781  21 521.0
>  24
> Latency             13103us     183ms     123ms   15316us     100ms   75899us
> Version  1.96       ------Sequential Create------ --------Random
> Create--------
> osd-local           -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>                files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> %CP
>                   16 16817  55 +++++ +++ 28786  77 23890  78 +++++ +++ 27128
> 75
> Latency             21549us     105us     134us     902us      12us     104us >
>  >
> 	While rsyncing the files, the ceph logs show lots
> of warnings of the form: >
> [WRN] : slow request 91.848407 seconds old, received at 2012-09-26
> 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write
> 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops >
> 	Snooping on traffic with wireshark shows bursts of
> activity separated by long periods (30-60 sec) of idle time. >

> My guess here is that if there is a lot of small IO happening, your SSD
> journal is handling it well and probably writing data really quickly,  while
> your spinning disk raid5 probably can't sustain anywhere near the  required
> IOPs to keep up.  So you get a burst of network traffic and the  journal
> writes it to the SSD quickly until it is filled up, then the OSD  stalls while
> it waits for the raid5 to write data out.  Whenever the  journal flushes, a
> new burst of traffic comes in and the process repeats.

	That sure sounds reasonable.  Maybe I can play some more
with the journal size and location to see how it affects the
speed and burstyness.

> 	My first thought was that I was seeing a kind of
> "bufferbloat". The SSDs are 120 GB, so they could easily contain
> enough data to take a long time to dump.  I changed to using a
> journal file, limited to 1 GB, but I still see the same slow
> behavior. >
> 	Any advice about how to go about debugging this would
> be appreciated.

> It'd probably be useful to look at the write sizes going to disk.  Increasing
> debugging levels in the Ceph logs will give you that, but it  can be a lot to
> parse.  You can also use something like iostat or  collectl to see what the
> per-second average write sizes are.

	I'll see what I can find out.  Here's a quick output
from iostat (on one of the OSD hosts) while an rsync was running:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.23    0.00    0.20    0.21    0.00   99.36

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdm               0.96         5.82        19.94    4523588   15495690
sdn               9.96         1.51      1080.91    1174143  839900311
sdb               0.00         0.00         0.00       2248          0
sdc               0.00         0.00         0.00       2248          0
sde               0.00         0.00         0.00       2248          0
sda               0.00         0.00         0.00       2248          0
sdf               0.00         0.00         0.00       2248          0
sdi               0.00         0.00         0.00       2248          0
sdl               0.00         0.00         0.00       2248          0
sdg               0.00         0.00         0.00       2248          0
sdj               0.00         0.00         0.00       2248          0
sdh               0.00         0.00         0.00       2248          0
sdd               0.00         0.00         0.00       2248          0
sdk               0.00         0.00         0.00       2248          0
dm-0              0.00         0.00         0.00       2616          0
dm-1              2.14         5.81        19.80    4512994   15387832
sdo              96.83       305.85      3156.74  237658672 2452896474
dm-2              0.00         0.00         0.00        800         48

	The relevant lines are "sdo", which is the RAID array where
the object store lives, and "sdn", which is the journal SSD.
	

> >
> 					Thanks,
> 					Bryan >

> Mark 




-- 
========================================================================
Bryan Wright              |"If you take cranberries and stew them like 
Physics Department        | applesauce, they taste much more like prunes
University of Virginia    | than rhubarb does."  --  Groucho 
Charlottesville, VA  22901|			
(434) 924-7218            |         bryan@virginia.edu
========================================================================



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-26 20:54   ` Bryan K. Wright
@ 2012-09-27 15:16     ` Bryan K. Wright
  2012-09-27 18:04     ` Gregory Farnum
  2012-09-27 23:40     ` Mark Kirkwood
  2 siblings, 0 replies; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-27 15:16 UTC (permalink / raw)
  To: ceph-devel

Hi folks,

	I'm still struggling to get decent performance out of
cephfs.  I've played around with journal size and location,
but I/O rates to the mounted ceph filesystem always hover in
the range of 2-6 MB/sec while rsyncing a large directory tree
onto the ceph fs.  In contrast, using rsync over ssh to copy
the same tree on to the same RAID array on one of the OSDs gives
a rate of about 34 MB/sec.

	Here's a time/sequence plot from wireshark showing
what the traffic looks like from the client's perspective
while rsyncing onto the ceph fs:

http://ayesha.phys.virginia.edu/~bryan/time-sequence-ceph-2.png

As you can see, most of the time is spent in long
waits between bursts of packets.  Using a small journal file
instead of a whole SSD seems to slightly reduce the delays,
but not by much.  What other tunable parameters should I be 
trying?

	Looking at outgoing network rates on the client
with iptraf, I see the following while rsyncing over ssh:

	Rate: ~300Mb/s, ~8k packets/s --> ~40kb/packet

While rsyncing to the ceph fs, I see:

	Rate: ~50Mb/s, ~1k packets/s --> ~50kb/packet

(i.e., the average packet size is about the same, but
about eight times fewer packets are being sent per unit
time.)

	Looking at ops in flight on one of the OSDs,
using "ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok
dump_ops_in_flight", I see:

{ "num_ops": 3,
  "ops": [
        { "description": "pg_log(0.8 epoch 12 query_epoch 12)",
          "received_at": "2012-09-27 10:54:08.070493",
          "age": "66.673834",
          "flag_point": "delayed"},
        { "description": "pg_log(1.7 epoch 12 query_epoch 12)",
          "received_at": "2012-09-27 10:54:08.070715",
          "age": "66.673612",
          "flag_point": "delayed"},
        { "description": "pg_log(2.6 epoch 12 query_epoch 12)",
          "received_at": "2012-09-27 10:54:08.070750",
          "age": "66.673577",
          "flag_point": "delayed"}]}

	Thanks for any advice.

					Bryan



bkw1a@ayesha.phys.virginia.edu said:
> Hi folks,
> 	I'm seeing reasonable performance when I run rados benchmarks, but really
> slow I/O when reading or writing  from a mounted ceph filesystem.  The rados
> benchmarks show about 150 MB/s for both read and write, but when I go to a
> client machine with a mounted ceph filesystem and try to rsync a large (60 GB)
> directory tree onto the ceph fs, I'm getting rates of only 2-5 MB/s.

> 	The OSDs and MDSs are all running 64-bit CentOS 6.3 with the stock CentOS
> 2.6.32 kernel.  The client is also 64-bit CentOS 6.3, but it's running the
> "elrepo" 3.5.4 kernel. There are four OSDs, each with a hardware RAID 5 array
> and an SSD for the OSD journal.  The primary network is a gigabit network, and
> the OSD, MDS and MON  machines have a dedicated backend gigabit network on a
> second network interface.

> 	Locally on the OSD, "hdparm -t -T" reports read rates  of ~350 MB/s, and
> bonnie++ shows:

> Version  1.96       ------Sequential Output------ --Sequential Input-
> --Random- Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr-
> --Block-- --Seeks-- Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec
> %CP K/sec %CP  /sec %CP osd-local    23800M  1037  99 316048  92 131023  19
> 2272  98 312781  21 521.0  24 Latency             13103us     183ms     123ms
>  15316us     100ms   75899us Version  1.96       ------Sequential Create------
> --------Random Create-------- osd-local           -Create-- --Read---
> -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
> %CP
>                  16 16817  55 +++++ +++ 28786  77 23890  78 +++++ +++ 27128
> 75 Latency             21549us     105us     134us     902us      12us
> 104us

> 	While rsyncing the files, the ceph logs show lots of warnings of the form:

> [WRN] : slow request 91.848407 seconds old, received at 2012-09-26
> 09:30:52.252449: osd_op(client.5310.1:56400 1000026eda0.00001ec8 [write
> 2093056~4096] 0.aa047db8 snapc 1=[]) currently waiting for sub ops

> 	Snooping on traffic with wireshark shows bursts of  activity separated by
> long periods (30-60 sec) of idle time.

> 	My first thought was that I was seeing a kind of  "bufferbloat". The SSDs are
> 120 GB, so they could easily contain  enough data to take a long time to dump.
>  I changed to using a  journal file, limited to 1 GB, but I still see the same
> slow behavior.

> 	Any advice about how to go about debugging this would be appreciated.

> 					Thanks,
> 					Bryan




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-26 20:54   ` Bryan K. Wright
  2012-09-27 15:16     ` Bryan K. Wright
@ 2012-09-27 18:04     ` Gregory Farnum
  2012-09-27 18:47       ` Bryan K. Wright
  2012-10-01 16:47       ` Tommi Virtanen
  2012-09-27 23:40     ` Mark Kirkwood
  2 siblings, 2 replies; 23+ messages in thread
From: Gregory Farnum @ 2012-09-27 18:04 UTC (permalink / raw)
  To: bryan; +Cc: Mark Nelson, ceph-devel

On Wed, Sep 26, 2012 at 1:54 PM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
> Hi Mark,
>
>         Thanks for your help.  Some answers to your questions
> are below.
>
> mark.nelson@inktank.com said:
>> On 09/26/2012 09:50 AM, Bryan K. Wright wrote:
>> Hi folks,
>> Hi Bryan!
>> >
>>       I'm seeing reasonable performance when I run rados
>> benchmarks, but really slow I/O when reading or writing
>> from a mounted ceph filesystem.  The rados benchmarks
>> show about 150 MB/s for both read and write, but when I
>> go to a client machine with a mounted ceph filesystem
>> and try to rsync a large (60 GB) directory tree onto
>> the ceph fs, I'm getting rates of only 2-5 MB/s.
>> Was the rados benchmark run from the same client machine that the  filesystem
>> is being mounted on?  Also, what object size did you use for  rados bench?
>> Does the directory tree have a lot of small files or a few  very large ones?
>
>         The rados benchmark was run on one of the OSD
> machines.  Read and write results looked like this (the
> objects size was just the default, which seems to be 4kB):

Actually, that's 4MB. ;) Can you run
# rados bench -p pbench 900 write -t 256 -b 4096
and see what that gets? It'll run 256 simultaneous 4KB writes. (You
can also vary the number of simultaneous writes and see if that
impacts it.)

However, my suspicion is that you're limited by metadata throughput
here. How large are your files? There might be some MDS or client
tunables we can adjust, but rsync's workload is a known weak spot for
CephFS.
-Greg

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-27 18:04     ` Gregory Farnum
@ 2012-09-27 18:47       ` Bryan K. Wright
  2012-09-27 19:47         ` Gregory Farnum
  2012-10-01 16:47       ` Tommi Virtanen
  1 sibling, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-09-27 18:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel


greg@inktank.com said:
> >
>         The rados benchmark was run on one of the OSD
> machines.  Read and write results looked like this (the
> objects size was just the default, which seems to be 4kB):
> Actually, that's 4MB. ;) 

	Oops! My plea is that I was the victim of a 
man page bug:

       bench seconds mode [ -b objsize ] [ -t threads ]
              Benchmark  for  seconds.  The  mode  can  be  write or read. The
              default object size is 4 KB, and the default number of simulated
              threads (parallel writes) is 16.


> Can you run # rados bench -p pbench 900 write -t 256
> -b 4096 and see what that gets? It'll run 256 simultaneous 4KB writes. (You
> can also vary the number of simultaneous writes and see if that impacts it.)

	Here's the new benchmark output:

 Total time run:         900.880070
Total writes made:      537187
Write size:             4096
Bandwidth (MB/sec):     2.329 

Stddev Bandwidth:       2.57691
Max bandwidth (MB/sec): 12.6055
Min bandwidth (MB/sec): 0
Average Latency:        0.429315
Stddev Latency:         0.891734
Max latency:            19.7647
Min latency:            0.016743
	

> However, my suspicion is that you're limited by metadata throughput here. How
> large are your files? There might be some MDS or client tunables we can
> adjust, but rsync's workload is a known weak spot for CephFS. -Greg 

	The file size is generally small.  Here's the distribution:

http://ayesha.phys.virginia.edu/~bryan/filesize.png

The mean is about 2.5 MB.

						Bryan

-- 
========================================================================
Bryan Wright              |"If you take cranberries and stew them like 
Physics Department        | applesauce, they taste much more like prunes
University of Virginia    | than rhubarb does."  --  Groucho 
Charlottesville, VA  22901|			
(434) 924-7218            |         bryan@virginia.edu
========================================================================


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-27 18:47       ` Bryan K. Wright
@ 2012-09-27 19:47         ` Gregory Farnum
  0 siblings, 0 replies; 23+ messages in thread
From: Gregory Farnum @ 2012-09-27 19:47 UTC (permalink / raw)
  To: Bryan K. Wright; +Cc: ceph-devel

On Thu, Sep 27, 2012 at 11:47 AM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
>
> greg@inktank.com said:
>> >
>>         The rados benchmark was run on one of the OSD
>> machines.  Read and write results looked like this (the
>> objects size was just the default, which seems to be 4kB):
>> Actually, that's 4MB. ;)
>
>         Oops! My plea is that I was the victim of a
> man page bug:
>
>        bench seconds mode [ -b objsize ] [ -t threads ]
>               Benchmark  for  seconds.  The  mode  can  be  write or read. The
>               default object size is 4 KB, and the default number of simulated
>               threads (parallel writes) is 16.

Whoops! I'd fix it but it's obfuscated somewhat now, so:
http://tracker.newdream.net/issues/3230


>
>
>> Can you run # rados bench -p pbench 900 write -t 256
>> -b 4096 and see what that gets? It'll run 256 simultaneous 4KB writes. (You
>> can also vary the number of simultaneous writes and see if that impacts it.)
>
>         Here's the new benchmark output:
>
>  Total time run:         900.880070
> Total writes made:      537187
> Write size:             4096
> Bandwidth (MB/sec):     2.329
>
> Stddev Bandwidth:       2.57691
> Max bandwidth (MB/sec): 12.6055
> Min bandwidth (MB/sec): 0
> Average Latency:        0.429315
> Stddev Latency:         0.891734
> Max latency:            19.7647
> Min latency:            0.016743

Hmm, that is significantly lower than I would have expected. Can you
check and see if you can get that number higher by increasing (or
decreasing) the number of in-flight ops? (-t param)

Given your size distribution, it could just be that your RAID arrays
aren't giving you the small random write throughput you expect.


>> However, my suspicion is that you're limited by metadata throughput here. How
>> large are your files? There might be some MDS or client tunables we can
>> adjust, but rsync's workload is a known weak spot for CephFS. -Greg
>
>         The file size is generally small.  Here's the distribution:
>
> http://ayesha.phys.virginia.edu/~bryan/filesize.png
>
> The mean is about 2.5 MB.

So that chart is measuring in KB? Anyway, it might be metadata — you
could see what the CPU usage on the MDS server looks like while
running the rsync.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-27 18:04     ` Gregory Farnum
  2012-09-27 18:47       ` Bryan K. Wright
@ 2012-10-01 16:47       ` Tommi Virtanen
  2012-10-01 17:00         ` Gregory Farnum
  2012-10-01 17:03         ` Mark Nelson
  1 sibling, 2 replies; 23+ messages in thread
From: Tommi Virtanen @ 2012-10-01 16:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: bryan, Mark Nelson, ceph-devel

On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum <greg@inktank.com> wrote:
> However, my suspicion is that you're limited by metadata throughput
> here. How large are your files? There might be some MDS or client
> tunables we can adjust, but rsync's workload is a known weak spot for
> CephFS.

I feel like people are missing this part of Greg's message. Everyone
is so busy benchmarking RADOS small I/O, but what if it's currently
bottlenecked by all the file-level access operations that interact
with the MDS? Rsync causes a ton of those.

If you want to benchmark just the small IO, you can't compare rsync to rsync.

If you want to benchmark just the metadata part, rsync with 0-size
files might actually be an interesting workload.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-01 16:47       ` Tommi Virtanen
@ 2012-10-01 17:00         ` Gregory Farnum
  2012-10-03 14:55           ` Bryan K. Wright
  2012-10-01 17:03         ` Mark Nelson
  1 sibling, 1 reply; 23+ messages in thread
From: Gregory Farnum @ 2012-10-01 17:00 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: bryan, Mark Nelson, ceph-devel

On Mon, Oct 1, 2012 at 9:47 AM, Tommi Virtanen <tv@inktank.com> wrote:
> On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum <greg@inktank.com> wrote:
>> However, my suspicion is that you're limited by metadata throughput
>> here. How large are your files? There might be some MDS or client
>> tunables we can adjust, but rsync's workload is a known weak spot for
>> CephFS.
>
> I feel like people are missing this part of Greg's message. Everyone
> is so busy benchmarking RADOS small I/O, but what if it's currently
> bottlenecked by all the file-level access operations that interact
> with the MDS? Rsync causes a ton of those.

Yes. Bryan, you mentioned that you didn't see a lot of resource usage
— was it perhaps flatlined at (100 * 1 / num_cpus)? The MDS is
multi-threaded in theory, but in practice it has the equivalent of a
Big Kernel Lock so it's not going to get much past one cpu core of
time...
The rados bench results do indicate some pretty bad small-file write
performance as well though, so I guess it's possible your testing is
running long enough that the page cache isn't absorbing that hit. Did
performance start out higher or has it been flat?

> If you want to benchmark just the small IO, you can't compare rsync to rsync.
>
> If you want to benchmark just the metadata part, rsync with 0-size
> files might actually be an interesting workload.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-01 17:00         ` Gregory Farnum
@ 2012-10-03 14:55           ` Bryan K. Wright
  2012-10-03 18:35             ` Gregory Farnum
  0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-03 14:55 UTC (permalink / raw)
  To: ceph-devel

Hi again,

	A few answers to questions from various people on the list
after my last e-mail:

greg@inktank.com said:
> Yes. Bryan, you mentioned that you didn't see a lot of resource usage — was it
> perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in
> theory, but in practice it has the equivalent of a Big Kernel Lock so it's not
> going to get much past one cpu core of time... 

	The CPU usage on the MDSs hovered around a few percent.
They're quad-core machines, and I didn't see it ever get as high
as 25% usage on any of the cores while watching with atop.

greg@inktank.com said:
> The rados bench results do indicate some pretty bad small-file write
> performance as well though, so I guess it's possible your testing is running
> long enough that the page cache isn't absorbing that hit. Did performance
> start out higher or has it been flat? 

	Looking at the details of the rados benchmark output, it does 
look like performance starts out better for the first few iterations,
and then goes bad.  Here's the begining of a typical small-file run:

 Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1     255      3683      3428   13.3894   13.3906  0.002569 0.0696906
     2     256      7561      7305   14.2661   15.1445  0.106437 0.0669534
     3     256     10408     10152   13.2173   11.1211  0.002176 0.0689543
     4     256     11256     11000    10.741    3.3125  0.002097 0.0846414
     5     256     11256     11000    8.5928         0         - 0.0846414
     6     256     11370     11114   7.23489  0.222656  0.002399 0.0962989
     7     255     12480     12225   6.82126   4.33984  0.117658  0.142335
     8     256     13289     13033   6.36311   3.15625  0.002574  0.151261
     9     256     13737     13481   5.85051      1.75  0.120657  0.158865
    10     256     14341     14085   5.50138   2.35938  0.022544  0.178298

I see the same behavior every time I repeat the small-file 
rados benchmark.  Here's a graph showing the first 100 "cur MB/s" values
for a short-file benchmark:

http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf

	On the other hand, with 4MB files, I see results that start out like 
this:

 Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      49        49         0         0         0         -         0
     2      76        76         0         0         0         -         0
     3     105       105         0         0         0         -         0
     4     133       133         0         0         0         -         0
     5     159       159         0         0         0         -         0
     6     188       188         0         0         0         -         0
     7     218       218         0         0         0         -         0
     8     246       246         0         0         0         -         0
     9     256       274        18   7.99904         8   8.97759   8.66218
    10     255       301        46   18.3978       112    9.1456   8.94095
    11     255       330        75   27.2695       116   9.06968     9.013
    12     255       358       103   34.3292       112   9.12486   9.04374

Here's a graph showing the first 100 "cur MB/s" values for a typical
4MB file benchmark:

http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf

mark.nelson@inktank.com said:
> When you were doing this, what kind of results did collectl give you for
> average write sizes to the underlying OSD disks? 

	The average "rwsize" reported by collectl hovered around 
6 +/- a few (in whatever units collectl reports) for the RAID
array, and around 15 for the journal SSD, while doing the small-file
rados benchmark.  Here's a screenshot showing atop running on
each of the MDS hosts, and collectl running on each of the OSD
hosts, while the benchmark was running:

http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png

Here's the same, but with collectl running on the MDSs instead of atop:

http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png

Looking at the last screenshot again, it does look like the disks on
the MDSs are getting some exercise, with ~40% utilization (if I'm
interpreting the collectl output correctly).

Here's a similar snapshot for the 4MB test:

http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png

It looks like similar "pct util" on the MDS disks, but much higher
average rwsize values on the OSDs.

mark.nelson@inktank.com said:
> There's multiple issues potentially here.  Part of it might be how  writes are
> coalesced by XFS in each scenario.  Part of it might also be  overhead due to
> XFS metadata reads/writes.  You could probably get a  better idea of both of
> these by running blktrace during the tests and  making seekwatcher movies of
> the results.  You not only can look at the  numbers of seeks, but also the
> kind (read/writes) and where on the disk  they are going.  That, and some of
> the raw blktrace data can give you a  lot of information about what is going
> on and whether or not seeks are  

	I'll take a look at blktrace and see what I can find out.

mark.nelson@inktank.com said:
> Beyond that, I do think you are correct in suspecting that there are  some
> Ceph limitations as well.  Some things that may be interesting to try:

> - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various
> thread counts - Increasing various op and byte limits (such as
> journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under
> the OSDs. 

	And I'll give some of these a try.

	Regarding the iozone benchmarks:
mark.nelson@inktank.com said:
> Do you happen to have the settings you used when you ran these tests?  I
> probably don't have time to try to repeat them now, but I can at least  take a
> quick look at them. 
> I'm slightly confused by the labels on the graph.  They can't possibly  mean
> that 2^16384 KB record sizes were tested.  Was that just up to 16MB  records
> and 16GB files?  That would make a lot more sense. 

I just did something like:

	cd /mnt/tmp (where the cephfs was mounted)
	iozone -a > /tmp/iozone.log

By default, iozone does its tests in the current working directory.
The graphs were just produced with the Generate_Graphs script
that comes with iozone.  There are certainly some problems with
the axis labeling, but I think your interpretation is correct.

mark.nelson@inktank.com said:
> This might be a dumb question, but was the ceph version of this test on  a
> single client on gigabit Ethernet?  If so, wouldn't that be the reason  you
> are maxing out at like 114MB/s? 

	Duh.  You're exactly right.  I should have noticed this.

	And finally:
tv@inktank.com said:
> If you want to benchmark just the metadata part, rsync with 0-size files might
> actually be an interesting workload. 

	I'll see if I can work out a way to do this.

			Thanks to everyone for the suggestions.
			Bryan
-- 
========================================================================
Bryan Wright              |"If you take cranberries and stew them like 
Physics Department        | applesauce, they taste much more like prunes
University of Virginia    | than rhubarb does."  --  Groucho 
Charlottesville, VA  22901|			
(434) 924-7218            |         bryan@virginia.edu
========================================================================

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-03 14:55           ` Bryan K. Wright
@ 2012-10-03 18:35             ` Gregory Farnum
  2012-10-04 13:14               ` Bryan K. Wright
  0 siblings, 1 reply; 23+ messages in thread
From: Gregory Farnum @ 2012-10-03 18:35 UTC (permalink / raw)
  To: bryan; +Cc: ceph-devel

I think I'm with Mark now — this does indeed look like too much random
IO for the disks to handle. In particular, Ceph requires that each
write be synced to disk before it's considered complete, which rsync
definitely doesn't. In the filesystem this is generally disguised
fairly well by all the caches and such in the way, but this use case
is unfriendly to that arrangement.

However, I am particularly struck by seeing one of your OSDs at 96%
disk utilization while the others remain <50%, and I've just realized
we never saw output from ceph -s. Can you provide that, please?
-Greg

On Wed, Oct 3, 2012 at 7:55 AM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
> Hi again,
>
>         A few answers to questions from various people on the list
> after my last e-mail:
>
> greg@inktank.com said:
>> Yes. Bryan, you mentioned that you didn't see a lot of resource usage — was it
>> perhaps flatlined at (100 * 1 / num_cpus)? The MDS is multi-threaded in
>> theory, but in practice it has the equivalent of a Big Kernel Lock so it's not
>> going to get much past one cpu core of time...
>
>         The CPU usage on the MDSs hovered around a few percent.
> They're quad-core machines, and I didn't see it ever get as high
> as 25% usage on any of the cores while watching with atop.
>
> greg@inktank.com said:
>> The rados bench results do indicate some pretty bad small-file write
>> performance as well though, so I guess it's possible your testing is running
>> long enough that the page cache isn't absorbing that hit. Did performance
>> start out higher or has it been flat?
>
>         Looking at the details of the rados benchmark output, it does
> look like performance starts out better for the first few iterations,
> and then goes bad.  Here's the begining of a typical small-file run:
>
>  Maintaining 256 concurrent writes of 4096 bytes for at least 900 seconds.
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1     255      3683      3428   13.3894   13.3906  0.002569 0.0696906
>      2     256      7561      7305   14.2661   15.1445  0.106437 0.0669534
>      3     256     10408     10152   13.2173   11.1211  0.002176 0.0689543
>      4     256     11256     11000    10.741    3.3125  0.002097 0.0846414
>      5     256     11256     11000    8.5928         0         - 0.0846414
>      6     256     11370     11114   7.23489  0.222656  0.002399 0.0962989
>      7     255     12480     12225   6.82126   4.33984  0.117658  0.142335
>      8     256     13289     13033   6.36311   3.15625  0.002574  0.151261
>      9     256     13737     13481   5.85051      1.75  0.120657  0.158865
>     10     256     14341     14085   5.50138   2.35938  0.022544  0.178298
>
> I see the same behavior every time I repeat the small-file
> rados benchmark.  Here's a graph showing the first 100 "cur MB/s" values
> for a short-file benchmark:
>
> http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4096-run1-09282012-curmbps.pdf
>
>         On the other hand, with 4MB files, I see results that start out like
> this:
>
>  Maintaining 256 concurrent writes of 4194304 bytes for at least 900 seconds.
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1      49        49         0         0         0         -         0
>      2      76        76         0         0         0         -         0
>      3     105       105         0         0         0         -         0
>      4     133       133         0         0         0         -         0
>      5     159       159         0         0         0         -         0
>      6     188       188         0         0         0         -         0
>      7     218       218         0         0         0         -         0
>      8     246       246         0         0         0         -         0
>      9     256       274        18   7.99904         8   8.97759   8.66218
>     10     255       301        46   18.3978       112    9.1456   8.94095
>     11     255       330        75   27.2695       116   9.06968     9.013
>     12     255       358       103   34.3292       112   9.12486   9.04374
>
> Here's a graph showing the first 100 "cur MB/s" values for a typical
> 4MB file benchmark:
>
> http://ayesha.phys.virginia.edu/~bryan/rados-bench-t256-b4MB-run1-09282012-curmbps.pdf
>
> mark.nelson@inktank.com said:
>> When you were doing this, what kind of results did collectl give you for
>> average write sizes to the underlying OSD disks?
>
>         The average "rwsize" reported by collectl hovered around
> 6 +/- a few (in whatever units collectl reports) for the RAID
> array, and around 15 for the journal SSD, while doing the small-file
> rados benchmark.  Here's a screenshot showing atop running on
> each of the MDS hosts, and collectl running on each of the OSD
> hosts, while the benchmark was running:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-atop-t256-b4096.png
>
> Here's the same, but with collectl running on the MDSs instead of atop:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4096.png
>
> Looking at the last screenshot again, it does look like the disks on
> the MDSs are getting some exercise, with ~40% utilization (if I'm
> interpreting the collectl output correctly).
>
> Here's a similar snapshot for the 4MB test:
>
> http://ayesha.phys.virginia.edu/~bryan/collectl-collectl-t256-b4MB.png
>
> It looks like similar "pct util" on the MDS disks, but much higher
> average rwsize values on the OSDs.
>
> mark.nelson@inktank.com said:
>> There's multiple issues potentially here.  Part of it might be how  writes are
>> coalesced by XFS in each scenario.  Part of it might also be  overhead due to
>> XFS metadata reads/writes.  You could probably get a  better idea of both of
>> these by running blktrace during the tests and  making seekwatcher movies of
>> the results.  You not only can look at the  numbers of seeks, but also the
>> kind (read/writes) and where on the disk  they are going.  That, and some of
>> the raw blktrace data can give you a  lot of information about what is going
>> on and whether or not seeks are
>
>         I'll take a look at blktrace and see what I can find out.
>
> mark.nelson@inktank.com said:
>> Beyond that, I do think you are correct in suspecting that there are  some
>> Ceph limitations as well.  Some things that may be interesting to try:
>
>> - 1 OSD per Disk - Multiple OSDs on the RAID array. - Increasing various
>> thread counts - Increasing various op and byte limits (such as
>> journal_max_write_entries and journal_max_write_bytes). - EXT4 or BTRFS under
>> the OSDs.
>
>         And I'll give some of these a try.
>
>         Regarding the iozone benchmarks:
> mark.nelson@inktank.com said:
>> Do you happen to have the settings you used when you ran these tests?  I
>> probably don't have time to try to repeat them now, but I can at least  take a
>> quick look at them.
>> I'm slightly confused by the labels on the graph.  They can't possibly  mean
>> that 2^16384 KB record sizes were tested.  Was that just up to 16MB  records
>> and 16GB files?  That would make a lot more sense.
>
> I just did something like:
>
>         cd /mnt/tmp (where the cephfs was mounted)
>         iozone -a > /tmp/iozone.log
>
> By default, iozone does its tests in the current working directory.
> The graphs were just produced with the Generate_Graphs script
> that comes with iozone.  There are certainly some problems with
> the axis labeling, but I think your interpretation is correct.
>
> mark.nelson@inktank.com said:
>> This might be a dumb question, but was the ceph version of this test on  a
>> single client on gigabit Ethernet?  If so, wouldn't that be the reason  you
>> are maxing out at like 114MB/s?
>
>         Duh.  You're exactly right.  I should have noticed this.
>
>         And finally:
> tv@inktank.com said:
>> If you want to benchmark just the metadata part, rsync with 0-size files might
>> actually be an interesting workload.
>
>         I'll see if I can work out a way to do this.
>
>                         Thanks to everyone for the suggestions.
>                         Bryan
> --
> ========================================================================
> Bryan Wright              |"If you take cranberries and stew them like
> Physics Department        | applesauce, they taste much more like prunes
> University of Virginia    | than rhubarb does."  --  Groucho
> Charlottesville, VA  22901|
> (434) 924-7218            |         bryan@virginia.edu
> ========================================================================
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-03 18:35             ` Gregory Farnum
@ 2012-10-04 13:14               ` Bryan K. Wright
  2012-10-04 15:24                 ` Sage Weil
  0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-04 13:14 UTC (permalink / raw)
  To: ceph-devel

Hi Greg,

greg@inktank.com said:
> I think I'm with Mark now — this does indeed look like too much random IO for
> the disks to handle. In particular, Ceph requires that each write be synced to
> disk before it's considered complete, which rsync definitely doesn't. In the
> filesystem this is generally disguised fairly well by all the caches and such
> in the way, but this use case is unfriendly to that arrangement.

> However, I am particularly struck by seeing one of your OSDs at 96% disk
> utilization while the others remain <50%, and I've just realized we never saw
> output from ceph -s. Can you provide that, please? 

	Here's the ceph -s output:

   health HEALTH_OK
   monmap e1: 3 mons at {0=192.168.1.31:6789/0,1=192.168.1.32:6789/0,2=192.168.1
.33:6789/0}, election epoch 2, quorum 0,1,2 0,1,2
   osdmap e24: 4 osds: 4 up, 4 in
    pgmap v8363: 960 pgs: 960 active+clean; 15099 MB data, 38095 MB used, 74354 
GB / 74391 GB avail
   mdsmap e25: 1/1/1 up {0=2=up:active}, 2 up:standby

	The OSD disk utilization seems to vary a lot during these
benchmarks.  My recollection is that each of the OSD hosts sometimes
sees near-100% utilization.

						Bryan


-- 
========================================================================
Bryan Wright              |"If you take cranberries and stew them like 
Physics Department        | applesauce, they taste much more like prunes
University of Virginia    | than rhubarb does."  --  Groucho 
Charlottesville, VA  22901|			
(434) 924-7218            |         bryan@virginia.edu
========================================================================


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-04 13:14               ` Bryan K. Wright
@ 2012-10-04 15:24                 ` Sage Weil
  2012-10-04 15:54                   ` Bryan K. Wright
  0 siblings, 1 reply; 23+ messages in thread
From: Sage Weil @ 2012-10-04 15:24 UTC (permalink / raw)
  To: bryan; +Cc: ceph-devel

On Thu, 4 Oct 2012, Bryan K. Wright wrote:
> Hi Greg,
> 
> greg@inktank.com said:
> > I think I'm with Mark now ? this does indeed look like too much random IO for
> > the disks to handle. In particular, Ceph requires that each write be synced to
> > disk before it's considered complete, which rsync definitely doesn't. In the
> > filesystem this is generally disguised fairly well by all the caches and such
> > in the way, but this use case is unfriendly to that arrangement.
> 
> > However, I am particularly struck by seeing one of your OSDs at 96% disk
> > utilization while the others remain <50%, and I've just realized we never saw
> > output from ceph -s. Can you provide that, please? 
> 
> 	Here's the ceph -s output:
> 
>    health HEALTH_OK
>    monmap e1: 3 mons at {0=192.168.1.31:6789/0,1=192.168.1.32:6789/0,2=192.168.1
> .33:6789/0}, election epoch 2, quorum 0,1,2 0,1,2
>    osdmap e24: 4 osds: 4 up, 4 in
>     pgmap v8363: 960 pgs: 960 active+clean; 15099 MB data, 38095 MB used, 74354 
> GB / 74391 GB avail
>    mdsmap e25: 1/1/1 up {0=2=up:active}, 2 up:standby
> 
> 	The OSD disk utilization seems to vary a lot during these
> benchmarks.  My recollection is that each of the OSD hosts sometimes
> sees near-100% utilization.

Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump' 
output?  So we can make sure CRUSH is distributing things well?

Thanks!
sage


> 
> 						Bryan
> 
> 
> -- 
> ========================================================================
> Bryan Wright              |"If you take cranberries and stew them like 
> Physics Department        | applesauce, they taste much more like prunes
> University of Virginia    | than rhubarb does."  --  Groucho 
> Charlottesville, VA  22901|			
> (434) 924-7218            |         bryan@virginia.edu
> ========================================================================
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-04 15:24                 ` Sage Weil
@ 2012-10-04 15:54                   ` Bryan K. Wright
  2012-10-26 20:48                     ` Gregory Farnum
  0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-04 15:54 UTC (permalink / raw)
  To: ceph-devel

Hi Sage,

sage@inktank.com said:
> Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump'
> output?  So we can make sure CRUSH is distributing things well? 

Here they are:

# ceph osd tree
dumped osdmap tree epoch 24
# id    weight  type name       up/down reweight
-1      4       pool default
-3      4               rack unknownrack
-2      1                       host ceph-osd-1
1       1                               osd.1   up      1
-4      1                       host ceph-osd-2
2       1                               osd.2   up      1
-5      1                       host ceph-osd-3
3       1                               osd.3   up      1
-6      1                       host ceph-osd-4
4       1                               osd.4   up      1

# ceph osd dump
dumped osdmap epoch 24
epoch 24
fsid 7e4e4302-4ced-439e-9786-49e6036dfda4
created 2012-09-28 13:17:40.774580
modifed 2012-09-28 16:56:02.864965
flags 

pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0

max_osd 5
osd.1 up   in  weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [10,15) 192.168.1.21:6800/3702 192.168.12.21:6800/3702 192.168.12.21:6801/3702 exists,up 4ad0b4cd-cbff-4693-b8f7-667148386cf3
osd.2 up   in  weight 1 up_from 17 up_thru 21 down_at 16 last_clean_interval [8,15) 192.168.1.22:6800/3428 192.168.12.22:6800/3428 192.168.12.22:6801/3428 exists,up 6a829cc6-fc60-450a-ac1d-8e148b757e57
osd.3 up   in  weight 1 up_from 21 up_thru 21 down_at 20 last_clean_interval [9,15) 192.168.1.23:6800/3436 192.168.12.23:6800/3436 192.168.12.23:6801/3436 exists,up 387cff7a-b857-434b-af66-0e08f56fd0f7
osd.4 up   in  weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [9,15) 192.168.1.24:6800/3486 192.168.12.24:6800/3486 192.168.12.24:6801/3486 exists,up fe8c4bf0-ff6f-41e9-91ac-d5826672f8b5

# ceph pg dump
See http://ayesha.phys.virginia.edu/~bryan/ceph-pg-dump.txt

					Bryan

-- 
========================================================================
Bryan Wright              |"If you take cranberries and stew them like 
Physics Department        | applesauce, they taste much more like prunes
University of Virginia    | than rhubarb does."  --  Groucho 
Charlottesville, VA  22901|			
(434) 924-7218            |         bryan@virginia.edu
========================================================================


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-04 15:54                   ` Bryan K. Wright
@ 2012-10-26 20:48                     ` Gregory Farnum
  2012-10-29 15:08                       ` Bryan K. Wright
  0 siblings, 1 reply; 23+ messages in thread
From: Gregory Farnum @ 2012-10-26 20:48 UTC (permalink / raw)
  To: bryan; +Cc: ceph-devel

On Thu, Oct 4, 2012 at 8:54 AM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
> Hi Sage,
>
> sage@inktank.com said:
>> Can you also include 'ceph osd tree', 'ceph osd dump', and 'ceph pg dump'
>> output?  So we can make sure CRUSH is distributing things well?
>
> Here they are:
>
> # ceph osd tree
> dumped osdmap tree epoch 24
> # id    weight  type name       up/down reweight
> -1      4       pool default
> -3      4               rack unknownrack
> -2      1                       host ceph-osd-1
> 1       1                               osd.1   up      1
> -4      1                       host ceph-osd-2
> 2       1                               osd.2   up      1
> -5      1                       host ceph-osd-3
> 3       1                               osd.3   up      1
> -6      1                       host ceph-osd-4
> 4       1                               osd.4   up      1
>
> # ceph osd dump
> dumped osdmap epoch 24
> epoch 24
> fsid 7e4e4302-4ced-439e-9786-49e6036dfda4
> created 2012-09-28 13:17:40.774580
> modifed 2012-09-28 16:56:02.864965
> flags
>
> pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0 crash_replay_interval 45
> pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0
> pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 320 pgp_num 320 last_change 1 owner 0
>
> max_osd 5
> osd.1 up   in  weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [10,15) 192.168.1.21:6800/3702 192.168.12.21:6800/3702 192.168.12.21:6801/3702 exists,up 4ad0b4cd-cbff-4693-b8f7-667148386cf3
> osd.2 up   in  weight 1 up_from 17 up_thru 21 down_at 16 last_clean_interval [8,15) 192.168.1.22:6800/3428 192.168.12.22:6800/3428 192.168.12.22:6801/3428 exists,up 6a829cc6-fc60-450a-ac1d-8e148b757e57
> osd.3 up   in  weight 1 up_from 21 up_thru 21 down_at 20 last_clean_interval [9,15) 192.168.1.23:6800/3436 192.168.12.23:6800/3436 192.168.12.23:6801/3436 exists,up 387cff7a-b857-434b-af66-0e08f56fd0f7
> osd.4 up   in  weight 1 up_from 18 up_thru 21 down_at 17 last_clean_interval [9,15) 192.168.1.24:6800/3486 192.168.12.24:6800/3486 192.168.12.24:6801/3486 exists,up fe8c4bf0-ff6f-41e9-91ac-d5826672f8b5
>
> # ceph pg dump
> See http://ayesha.phys.virginia.edu/~bryan/ceph-pg-dump.txt

Eeek, I was going through my email backlog and came across this thread
again. Everything here does look good; the data distribution etc is
pretty reasonable.
If you're still testing, we can at least get a rough idea of the sorts
of IO the OSD is doing by looking at the perfcounters out of the admin
socket:
ceph --admin-daemon /path/to/socket perf dump
(I believe the default path is /var/run/ceph/ceph-osd.*.asok)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-26 20:48                     ` Gregory Farnum
@ 2012-10-29 15:08                       ` Bryan K. Wright
  2012-11-03 17:55                         ` Gregory Farnum
  0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-29 15:08 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: bryan, ceph-devel


greg@inktank.com said:
> Eeek, I was going through my email backlog and came across this thread again.
> Everything here does look good; the data distribution etc is pretty
> reasonable. If you're still testing, we can at least get a rough idea of the
> sorts of IO the OSD is doing by looking at the perfcounters out of the admin
> socket: ceph --admin-daemon /path/to/socket perf dump (I believe the default
> path is /var/run/ceph/ceph-osd.*.asok)

Hi Greg,

	Thanks for your help.  I've been experimenting with other things,
so the cluster has a different arrangement now, but the performance
seems to be about the same.  I've now broken down the RAID arrays into
JBOD disks, and I'm running one OSD per disk, recklessly ignoring
the warning about syncfs being missing.  (Performance doesn't seem
any better or worse than it was before when rsyncing a large directory
of small files.)  I've also added another osd node into the mix, with
a different disk controller.

	For what it's worth, here are "perf dump" outputs for a
couple of OSDs running on the old and new hardware, respectively:

http://ayesha.phys.virginia.edu/~bryan/perf.osd.200.txt
http://ayesha.phys.virginia.edu/~bryan/perf.osd.100.txt

If you could take a look at them and let me know if you see
anything enlightening, I'd really appreciate it.

					Thanks,
					Bryan

-- 
========================================================================
Bryan Wright              |"If you take cranberries and stew them like 
Physics Department        | applesauce, they taste much more like prunes
University of Virginia    | than rhubarb does."  --  Groucho 
Charlottesville, VA  22901|			
(434) 924-7218            |         bryan@virginia.edu
========================================================================



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-29 15:08                       ` Bryan K. Wright
@ 2012-11-03 17:55                         ` Gregory Farnum
  0 siblings, 0 replies; 23+ messages in thread
From: Gregory Farnum @ 2012-11-03 17:55 UTC (permalink / raw)
  To: Bryan K. Wright, Samuel Just; +Cc: bryan, ceph-devel@vger.kernel.org

On Mon, Oct 29, 2012 at 4:08 PM, Bryan K. Wright
<bkw1a@ayesha.phys.virginia.edu> wrote:
>
> greg@inktank.com said:
>> Eeek, I was going through my email backlog and came across this thread again.
>> Everything here does look good; the data distribution etc is pretty
>> reasonable. If you're still testing, we can at least get a rough idea of the
>> sorts of IO the OSD is doing by looking at the perfcounters out of the admin
>> socket: ceph --admin-daemon /path/to/socket perf dump (I believe the default
>> path is /var/run/ceph/ceph-osd.*.asok)
>
> Hi Greg,
>
>         Thanks for your help.  I've been experimenting with other things,
> so the cluster has a different arrangement now, but the performance
> seems to be about the same.  I've now broken down the RAID arrays into
> JBOD disks, and I'm running one OSD per disk, recklessly ignoring
> the warning about syncfs being missing.  (Performance doesn't seem
> any better or worse than it was before when rsyncing a large directory
> of small files.)  I've also added another osd node into the mix, with
> a different disk controller.
>
>         For what it's worth, here are "perf dump" outputs for a
> couple of OSDs running on the old and new hardware, respectively:
>
> http://ayesha.phys.virginia.edu/~bryan/perf.osd.200.txt
> http://ayesha.phys.virginia.edu/~bryan/perf.osd.100.txt
>
> If you could take a look at them and let me know if you see
> anything enlightening, I'd really appreciate it.

Sam, can you check these out? I notice in particular that the average
"apply_latency" is 1.44 seconds — but I don't know if I have the units
right on that or have parsed something else wrong.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-01 16:47       ` Tommi Virtanen
  2012-10-01 17:00         ` Gregory Farnum
@ 2012-10-01 17:03         ` Mark Nelson
  1 sibling, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2012-10-01 17:03 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Gregory Farnum, bryan, ceph-devel

On 10/01/2012 11:47 AM, Tommi Virtanen wrote:
> On Thu, Sep 27, 2012 at 11:04 AM, Gregory Farnum<greg@inktank.com>  wrote:
>> However, my suspicion is that you're limited by metadata throughput
>> here. How large are your files? There might be some MDS or client
>> tunables we can adjust, but rsync's workload is a known weak spot for
>> CephFS.
>
> I feel like people are missing this part of Greg's message. Everyone
> is so busy benchmarking RADOS small I/O, but what if it's currently
> bottlenecked by all the file-level access operations that interact
> with the MDS? Rsync causes a ton of those.
>
> If you want to benchmark just the small IO, you can't compare rsync to rsync.
>
> If you want to benchmark just the metadata part, rsync with 0-size
> files might actually be an interesting workload.

I guess most of the small IO testing we've seen/done has been without 
CephFS at all.  It's entirely possible that the MDS is slowing things 
down with an rsync workload like this on a fresh filesystem though. 
Having said that, I don't like the way that our small IO performance 
behaves (especially over time) when doing something like RADOS Bench. 
It definitely seems like there is some pretty nasty underlying 
filesystem metadata fragmentation or something going on after a while.

Mark

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-26 20:54   ` Bryan K. Wright
  2012-09-27 15:16     ` Bryan K. Wright
  2012-09-27 18:04     ` Gregory Farnum
@ 2012-09-27 23:40     ` Mark Kirkwood
  2012-09-27 23:49       ` Mark Kirkwood
  2 siblings, 1 reply; 23+ messages in thread
From: Mark Kirkwood @ 2012-09-27 23:40 UTC (permalink / raw)
  To: bryan; +Cc: Bryan K. Wright, Mark Nelson, ceph-devel

Bryan -

Note that the default block size for the rados bench is 4MB...and 
performance decreases quite dramatically with smaller block sizes (-b 
option to rados bench).

On 27/09/12 08:54, Bryan K. Wright wrote:
>
> 	The rados benchmark was run on one of the OSD
> machines.  Read and write results looked like this (the
> objects size was just the default, which seems to be 4kB):
>
> # rados bench -p pbench 900 write
> Total time run:         900.549729
> Total writes made:      33819
> Write size:             4194304
> Bandwidth (MB/sec):     150.215
>
> Stddev Bandwidth:       16.2592
> Max bandwidth (MB/sec): 212
> Min bandwidth (MB/sec): 84
> Average Latency:        0.426028
> Stddev Latency:         0.24688
> Max latency:            1.59936
> Min latency:            0.06794
>
> # rados bench -p pbench 900 seq
> Total time run:        900.572788
> Total reads made:     33676
> Read size:            4194304
> Bandwidth (MB/sec):    149.576
>
> Average Latency:       0.427844
> Max latency:           1.48576
> Min latency:           0.015371
>
>
>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-27 23:40     ` Mark Kirkwood
@ 2012-09-27 23:49       ` Mark Kirkwood
  2012-09-28 12:22         ` mark seger
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Kirkwood @ 2012-09-27 23:49 UTC (permalink / raw)
  To: bryan; +Cc: Bryan K. Wright, Mark Nelson, ceph-devel

Sorry Bryan - I should have read further down the thread and noted that 
you have this figured out... nothing to see here!

On 28/09/12 11:40, Mark Kirkwood wrote:
> Bryan -
>
> Note that the default block size for the rados bench is 4MB...and 
> performance decreases quite dramatically with smaller block sizes (-b 
> option to rados bench).
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-27 23:49       ` Mark Kirkwood
@ 2012-09-28 12:22         ` mark seger
  2012-10-01 15:41           ` Bryan K. Wright
  0 siblings, 1 reply; 23+ messages in thread
From: mark seger @ 2012-09-28 12:22 UTC (permalink / raw)
  To: ceph-devel

I realize I'm a little late to this party but since collectl was mentioned 
thought I'd jump in.  ;)

Whenever I do any file system testing I also have a copy of collectl running in 
another window.
Just looking at total transfer times can end up taking you down 
the wrong path.
What is there are long stalls and very burst I/O?  could be a
starved resource or network issue that has nothing to do with he disks at all.

As for iostat, while you're certainly welcome to use it and I based the collectl
output display format on it, I'd highly recommend using iostat -x to see
wait/service times as those can be key to seeing what's happening.

Also, if you use collectl in stead with "-sD --home" you'll basically see the
output in a top-like format, making it real easy to see what's happening.
Further if you apply the right filter you can simply watch a single disk, line by
line w/o any pesky headers in your way.

-mark

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-09-28 12:22         ` mark seger
@ 2012-10-01 15:41           ` Bryan K. Wright
  2012-10-01 16:43             ` Mark Nelson
  0 siblings, 1 reply; 23+ messages in thread
From: Bryan K. Wright @ 2012-10-01 15:41 UTC (permalink / raw)
  To: ceph-devel

Hi again,

	I've fiddled around a lot with journal settings, so
to make sure I'm comparing apples to apples, I went back and
systematically re-ran the benchmark tests I've been running
(and some more).  A long data dump follows, but the end result 
is that it does look like something fishy is going on for small 
file sizes.  For example, performance difference between 4MB 
and 4KB files in the rados write benchmark is a factor of 25 or 
more. Here are the details, with a recap of the configuration
at the end.

	I started out by remaking the underlying xfs filesystems
on the OSD hosts, and then rerunning mkcephfs.  The journals
are 120 GB SSDs.

	First, the rsync tests again:

* Rsync of ~60 GB directory tree (mostly small files) from ceph client 
  to mounted cephfs goes at about 5.2 MB/s.

* I then turned off ceph (service ceph -a stop) and did the same 
  rsync between the same two hosts, onto the same RAID array on
  one of the OSD hosts, but using ssh this time.   This time it
  goes at about 37 MB/s.

This implies to me that the slowdown is somewhere in ceph, not in
the RAID array or the network connectivity.

	I then remade the xfs filessytems again, re-ran mkcephfs,
restarted ceph and did some rados benchmarks.

* rados bench -p pbench 900 write -t 256 -b 4096
Total time run:         900.184096
Total writes made:      1052511
Write size:             4096
Bandwidth (MB/sec):     4.567 

Stddev Bandwidth:       4.34241
Max bandwidth (MB/sec): 23.1719
Min bandwidth (MB/sec): 0
Average Latency:        0.218949
Stddev Latency:         0.566181
Max latency:            9.92952
Min latency:            0.001449


* rados bench -p pbench 900 write -t 256 (default 4MB size)
Total time run:         900.816140
Total writes made:      25263
Write size:             4194304
Bandwidth (MB/sec):     112.178 

Stddev Bandwidth:       27.1239
Max bandwidth (MB/sec): 840
Min bandwidth (MB/sec): 0
Average Latency:        9.08281
Stddev Latency:         0.505372
Max latency:            9.31865
Min latency:            0.818949

	I repeated each of these benchmarks three times, but saw
similar results each time (a factor of 25 or more in speed between
small and large object sizes).

	Next, I stopped ceph and took a look at local RAID
performance as a function of file size using "iozone":

http://ayesha.phys.virginia.edu/~bryan/iozone-write-local-raid.pdf

Then I re-made the ceph filesystem and restarted ceph, and used
iozone on the ceph client to look at the mounted ceph filesystem:

http://ayesha.phys.virginia.edu/~bryan/iozone-write-cephfs.pdf

I'm not sure how to interpret the iozone performance numbers,
but the distribution certainly looks much less uniform across
different file and chunk sizes for the mounted ceph filesystem.

	Finally, I took a look at the results of bonnie++
benchmarks for I/O directly to the RAID array, or to the
mounted ceph filesystem.

* Looking at RAID array from one of the OSD hosts:
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RAID on OSD  23800M  1155  99 318264  26 132959  19  2884  99 293464  20 535.4  23
Latency              7354us   30955us     129ms    8220us     119ms   62188us
Version  1.96       ------Sequential Create------ --------Random Create--------
RAID on OSD         -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 17680  58 +++++ +++ 26994  78 24715  81 +++++ +++ 26597  78
Latency               113us     105us     153us     109us      15us      94us

* Looking at the mounted ceph filesystem from the ceph client:
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
cephfs, client  16G  1101  95 114623   8 45713   2  2665  98 133537   3 882.0  14
Latency             44515us   37018us    6437ms   12747us     469ms   60004us
Version  1.96       ------Sequential Create------ --------Random Create--------
cephfs, client      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   653   3 19886   9   601   3   746   3 +++++ +++   585   2
Latency              1171ms    7467us     174ms     104ms      19us     228ms

	This seems to show about a factor of 3 difference in speed between
writing to the mounted ceph filesystem and writing directly to the RAID
array.

	While I was doing these, I kept an eye on the OSDs and MDSs
with collectl and atop, but I didn't see anything that looked 
like an obvious problem.  The MDSs didn't see very high CPU, I/O
or memory usage, for example.

	Finally, to recap the configuration:

3 MDS hosts
4 OSD hosts, each with a RAID array for object storage and an SSD journal
xfs filesystems for the object stores
gigabit network on the front end, and a separate back end gigabit network for the ceph hosts.
64-bit CentOS 6.3 and ceph 0.48.2 everywhere
ceph servers running stock CentOS 2.6.32-279.9.1 kernel.
client running "elrepo" 3.5.4-1 kernel.

						Bryan

-- 
========================================================================
Bryan Wright              |"If you take cranberries and stew them like 
Physics Department        | applesauce, they taste much more like prunes
University of Virginia    | than rhubarb does."  --  Groucho 
Charlottesville, VA  22901|			
(434) 924-7218            |         bryan@virginia.edu
========================================================================


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Slow ceph fs performance
  2012-10-01 15:41           ` Bryan K. Wright
@ 2012-10-01 16:43             ` Mark Nelson
  0 siblings, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2012-10-01 16:43 UTC (permalink / raw)
  To: bryan; +Cc: Bryan K. Wright, ceph-devel

On 10/01/2012 10:41 AM, Bryan K. Wright wrote:
> Hi again,
>

Hello!

> 	I've fiddled around a lot with journal settings, so
> to make sure I'm comparing apples to apples, I went back and
> systematically re-ran the benchmark tests I've been running
> (and some more).  A long data dump follows, but the end result
> is that it does look like something fishy is going on for small
> file sizes.  For example, performance difference between 4MB
> and 4KB files in the rados write benchmark is a factor of 25 or
> more. Here are the details, with a recap of the configuration
> at the end.
>

Probably one of the most important things to think about when dealing 
with small IOs on spinning disks is how well the operating system / file 
system combine small writes into larger ones.  With spinning disks you 
get so few iops to work with that your throughput is almost entirely 
governed by seek behavior.  There are many possible reasons for slow 
performance, but this should always be something you keep in mind during 
your tests.

> 	I started out by remaking the underlying xfs filesystems
> on the OSD hosts, and then rerunning mkcephfs.  The journals
> are 120 GB SSDs.
>
> 	First, the rsync tests again:
>
> * Rsync of ~60 GB directory tree (mostly small files) from ceph client
>    to mounted cephfs goes at about 5.2 MB/s.
>

When you were doing this, what kind of results did collectl give you for 
average write sizes to the underlying OSD disks?

> * I then turned off ceph (service ceph -a stop) and did the same
>    rsync between the same two hosts, onto the same RAID array on
>    one of the OSD hosts, but using ssh this time.   This time it
>    goes at about 37 MB/s.
>
> This implies to me that the slowdown is somewhere in ceph, not in
> the RAID array or the network connectivity.
>

There's multiple issues potentially here.  Part of it might be how 
writes are coalesced by XFS in each scenario.  Part of it might also be 
overhead due to XFS metadata reads/writes.  You could probably get a 
better idea of both of these by running blktrace during the tests and 
making seekwatcher movies of the results.  You not only can look at the 
numbers of seeks, but also the kind (read/writes) and where on the disk 
they are going.  That, and some of the raw blktrace data can give you a 
lot of information about what is going on and whether or not seeks are 
related to metadata.

Beyond that, I do think you are correct in suspecting that there are 
some Ceph limitations as well.  Some things that may be interesting to try:

- 1 OSD per Disk
- Multiple OSDs on the RAID array.
- Increasing various thread counts
- Increasing various op and byte limits (such as 
journal_max_write_entries and journal_max_write_bytes).
- EXT4 or BTRFS under the OSDs.

> 	I then remade the xfs filessytems again, re-ran mkcephfs,
> restarted ceph and did some rados benchmarks.
>
> * rados bench -p pbench 900 write -t 256 -b 4096
> Total time run:         900.184096
> Total writes made:      1052511
> Write size:             4096
> Bandwidth (MB/sec):     4.567
>
> Stddev Bandwidth:       4.34241
> Max bandwidth (MB/sec): 23.1719
> Min bandwidth (MB/sec): 0
> Average Latency:        0.218949
> Stddev Latency:         0.566181
> Max latency:            9.92952
> Min latency:            0.001449
>

XFS does pretty poorly with RADOS bench at small IO sizes from what I've 
seen.  EXT4 and BTRFS tend to do better, but probably not more than 2-3 
times better.

>
> * rados bench -p pbench 900 write -t 256 (default 4MB size)
> Total time run:         900.816140
> Total writes made:      25263
> Write size:             4194304
> Bandwidth (MB/sec):     112.178
>
> Stddev Bandwidth:       27.1239
> Max bandwidth (MB/sec): 840
> Min bandwidth (MB/sec): 0
> Average Latency:        9.08281
> Stddev Latency:         0.505372
> Max latency:            9.31865
> Min latency:            0.818949
>

I imagine your Max throughput for 4MB IOs is being limited by the 
network here.  You may be able to get higher aggregate performance by 
running rados bench on multiple clients concurrently.

> 	I repeated each of these benchmarks three times, but saw
> similar results each time (a factor of 25 or more in speed between
> small and large object sizes).
>
> 	Next, I stopped ceph and took a look at local RAID
> performance as a function of file size using "iozone":
>
> http://ayesha.phys.virginia.edu/~bryan/iozone-write-local-raid.pdf
>
> Then I re-made the ceph filesystem and restarted ceph, and used
> iozone on the ceph client to look at the mounted ceph filesystem:
>
> http://ayesha.phys.virginia.edu/~bryan/iozone-write-cephfs.pdf
>

Do you happen to have the settings you used when you ran these tests?  I 
probably don't have time to try to repeat them now, but I can at least 
take a quick look at them.

> I'm not sure how to interpret the iozone performance numbers,
> but the distribution certainly looks much less uniform across
> different file and chunk sizes for the mounted ceph filesystem.
>

Indeed.  Some of that is to be expected just because of the increased 
complexity and number of ways that things can get backed up in a 
distributed system like Ceph.  Having said that, the trench in the 
middle of the Ceph distribution is interesting.  I wouldn't mind digging 
into that more.

I'm slightly confused by the labels on the graph.  They can't possibly 
mean that 2^16384 KB record sizes were tested.  Was that just up to 16MB 
records and 16GB files?  That would make a lot more sense.

> 	Finally, I took a look at the results of bonnie++
> benchmarks for I/O directly to the RAID array, or to the
> mounted ceph filesystem.
>
> * Looking at RAID array from one of the OSD hosts:
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> RAID on OSD  23800M  1155  99 318264  26 132959  19  2884  99 293464  20 535.4  23
> Latency              7354us   30955us     129ms    8220us     119ms   62188us
> Version  1.96       ------Sequential Create------ --------Random Create--------
> RAID on OSD         -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>                files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                   16 17680  58 +++++ +++ 26994  78 24715  81 +++++ +++ 26597  78
> Latency               113us     105us     153us     109us      15us      94us
>
> * Looking at the mounted ceph filesystem from the ceph client:
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> cephfs, client  16G  1101  95 114623   8 45713   2  2665  98 133537   3 882.0  14
> Latency             44515us   37018us    6437ms   12747us     469ms   60004us
> Version  1.96       ------Sequential Create------ --------Random Create--------
> cephfs, client      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>                files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                   16   653   3 19886   9   601   3   746   3 +++++ +++   585   2
> Latency              1171ms    7467us     174ms     104ms      19us     228ms
>
> 	This seems to show about a factor of 3 difference in speed between
> writing to the mounted ceph filesystem and writing directly to the RAID
> array.

This might be a dumb question, but was the ceph version of this test on 
a single client on gigabit Ethernet?  If so, wouldn't that be the reason 
you are maxing out at like 114MB/s?

>
> 	While I was doing these, I kept an eye on the OSDs and MDSs
> with collectl and atop, but I didn't see anything that looked
> like an obvious problem.  The MDSs didn't see very high CPU, I/O
> or memory usage, for example.
>
> 	Finally, to recap the configuration:
>
> 3 MDS hosts
> 4 OSD hosts, each with a RAID array for object storage and an SSD journal
> xfs filesystems for the object stores
> gigabit network on the front end, and a separate back end gigabit network for the ceph hosts.
> 64-bit CentOS 6.3 and ceph 0.48.2 everywhere
> ceph servers running stock CentOS 2.6.32-279.9.1 kernel.
> client running "elrepo" 3.5.4-1 kernel.
>
> 						Bryan
>

Mark

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-11-03 17:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-26 14:50 Slow ceph fs performance Bryan K. Wright
2012-09-26 15:26 ` Mark Nelson
2012-09-26 20:54   ` Bryan K. Wright
2012-09-27 15:16     ` Bryan K. Wright
2012-09-27 18:04     ` Gregory Farnum
2012-09-27 18:47       ` Bryan K. Wright
2012-09-27 19:47         ` Gregory Farnum
2012-10-01 16:47       ` Tommi Virtanen
2012-10-01 17:00         ` Gregory Farnum
2012-10-03 14:55           ` Bryan K. Wright
2012-10-03 18:35             ` Gregory Farnum
2012-10-04 13:14               ` Bryan K. Wright
2012-10-04 15:24                 ` Sage Weil
2012-10-04 15:54                   ` Bryan K. Wright
2012-10-26 20:48                     ` Gregory Farnum
2012-10-29 15:08                       ` Bryan K. Wright
2012-11-03 17:55                         ` Gregory Farnum
2012-10-01 17:03         ` Mark Nelson
2012-09-27 23:40     ` Mark Kirkwood
2012-09-27 23:49       ` Mark Kirkwood
2012-09-28 12:22         ` mark seger
2012-10-01 15:41           ` Bryan K. Wright
2012-10-01 16:43             ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.