poor OSD performance using kernel 3.4

All of lore.kernel.org
 help / color / mirror / Atom feed

* poor OSD performance using kernel 3.4
@ 2012-05-24 14:10 Stefan Priebe - Profihost AG
  2012-05-24 14:57 ` Mark Nelson
                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-24 14:10 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi list,

today while testing btrfs i discovered a very poor osd performance using
kernel 3.4.

Underlying FS is XFS but it is the same with btrfs.

3.0.30:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    0       0         0         0         0         0         -         0
    1      16        41        25   99.9767       100  0.586984  0.447293
    2      16        71        55   109.979       120  0.934388  0.488375
    3      16        99        83   110.647       112   1.15982  0.503111
    4      16       130       114   113.981       124   1.05952  0.516925
    5      16       159       143   114.382       116  0.149313  0.510734
    6      16       188       172   114.649       116  0.287166   0.52203
    7      16       215       199   113.697       108  0.151784  0.531461
    8      16       242       226   112.984       108  0.623478  0.539896
    9      16       265       249   110.651        92   0.50354  0.538504
   10      16       296       280   111.984       124  0.155048  0.542846
Total time run:        10.776153
Total writes made:     297
Write size:            4194304
Bandwidth (MB/sec):    110.243

Average Latency:       0.577534
Max latency:           1.85499
Min latency:           0.091473


3.4:
~# rados -p data bench 10 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    0       0         0         0         0         0         -         0
    1      16        40        24   95.9794        96  0.393196  0.455936
    2      16        68        52   103.983       112  0.835652  0.517297
    3      16        85        69   91.9849        68   1.00535  0.493058
    4      16        96        80   79.9869        44  0.096564  0.577948
    5      16       103        87   69.5879        28  0.092722  0.589147
    6      16       117       101   67.3216        56  0.222175  0.675334
    7      16       130       114   65.1321        52   0.15677  0.623806
    8      16       144       128   63.9896        56  0.089157   0.56746
    9      16       144       128   56.8794         0         -   0.56746
   10      16       144       128   51.1912         0         -   0.56746
   11      16       144       128   46.5373         0         -   0.56746
   12      16       144       128   42.6591         0         -   0.56746
   13      16       144       128   39.3776         0         -   0.56746
   14      16       144       128   36.5649         0         -   0.56746
   15      16       144       128   34.1272         0         -   0.56746
   16      16       145       129   32.2443       0.5   11.3422  0.650985
Total time run:        16.193871
Total writes made:     145
Write size:            4194304
Bandwidth (MB/sec):    35.816

Average Latency:       1.78467
Max latency:           14.4744
Min latency:           0.088753

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-24 14:10 poor OSD performance using kernel 3.4 Stefan Priebe - Profihost AG
@ 2012-05-24 14:57 ` Mark Nelson
       [not found] ` <CAJCPpW+SKnnVUaDEAsCkKyZwMVrHCRJF2C8zqB4eORgwW5p=1Q@mail.gmail.com>
  2012-05-29 22:25 ` poor OSD performance using kernel 3.4 Mark Nelson
  2 siblings, 0 replies; 73+ messages in thread
From: Mark Nelson @ 2012-05-24 14:57 UTC (permalink / raw)
  Cc: ceph-devel@vger.kernel.org

Hi Stefan,

Were these both tested on fresh filesystems?  If you still have any 
3.0.30 available, could you try a couple of longer running tests (say 5 
minutes) and see how they compare?

Thanks,
Mark

On 05/24/2012 09:10 AM, Stefan Priebe - Profihost AG wrote:
> Hi list,
>
> today while testing btrfs i discovered a very poor osd performance using
> kernel 3.4.
>
> Underlying FS is XFS but it is the same with btrfs.
>
> 3.0.30:
> ~# rados -p data bench 10 write -t 16
> Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1      16        41        25   99.9767       100  0.586984  0.447293
>      2      16        71        55   109.979       120  0.934388  0.488375
>      3      16        99        83   110.647       112   1.15982  0.503111
>      4      16       130       114   113.981       124   1.05952  0.516925
>      5      16       159       143   114.382       116  0.149313  0.510734
>      6      16       188       172   114.649       116  0.287166   0.52203
>      7      16       215       199   113.697       108  0.151784  0.531461
>      8      16       242       226   112.984       108  0.623478  0.539896
>      9      16       265       249   110.651        92   0.50354  0.538504
>     10      16       296       280   111.984       124  0.155048  0.542846
> Total time run:        10.776153
> Total writes made:     297
> Write size:            4194304
> Bandwidth (MB/sec):    110.243
>
> Average Latency:       0.577534
> Max latency:           1.85499
> Min latency:           0.091473
>
>
> 3.4:
> ~# rados -p data bench 10 write -t 16
> Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1      16        40        24   95.9794        96  0.393196  0.455936
>      2      16        68        52   103.983       112  0.835652  0.517297
>      3      16        85        69   91.9849        68   1.00535  0.493058
>      4      16        96        80   79.9869        44  0.096564  0.577948
>      5      16       103        87   69.5879        28  0.092722  0.589147
>      6      16       117       101   67.3216        56  0.222175  0.675334
>      7      16       130       114   65.1321        52   0.15677  0.623806
>      8      16       144       128   63.9896        56  0.089157   0.56746
>      9      16       144       128   56.8794         0         -   0.56746
>     10      16       144       128   51.1912         0         -   0.56746
>     11      16       144       128   46.5373         0         -   0.56746
>     12      16       144       128   42.6591         0         -   0.56746
>     13      16       144       128   39.3776         0         -   0.56746
>     14      16       144       128   36.5649         0         -   0.56746
>     15      16       144       128   34.1272         0         -   0.56746
>     16      16       145       129   32.2443       0.5   11.3422  0.650985
> Total time run:        16.193871
> Total writes made:     145
> Write size:            4194304
> Bandwidth (MB/sec):    35.816
>
> Average Latency:       1.78467
> Max latency:           14.4744
> Min latency:           0.088753
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
       [not found]   ` <4FBE7ABC.5020502@profihost.ag>
@ 2012-05-24 18:53     ` Mark Nelson
  2012-05-24 19:05       ` Stefan Priebe
  0 siblings, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-24 18:53 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org

Hi Stefan,

Thanks for the info!  I've been testing on 3.4 for the last couple of 
days but haven't run into that problem here.  It looks like your journal 
has writes going to it quickly and then things stall as it tries to 
write out to your data disk.  I wonder if any of the data actually makes 
it to the disk...  Can you run iostat or collectl or something and see 
what kind of write throughput you get to the OSD data disks?

Thanks,
Mark

On 05/24/2012 01:15 PM, Stefan Priebe wrote:
>
> Am 24.05.2012 16:55, schrieb Mark Nelson:
>> Hi Stefan,
>>
>> Were these both tested on fresh filesystems?  If you still have any
>> 3.0.30 available, could you try a couple of longer running tests (say 5
>> minutes) and see how they compare?
>
> Yes with 3.4 it totally stalls. Tested with XFS and btrfs. Client 
> always had the same Kernel. So i just changed the kernel on osd side.
>
> Kernel 3.4
> http://pastebin.com/raw.php?i=CApKbSNj
>
> Kernel 3.0.30
> http://pastebin.com/raw.php?i=kZ7rnwcM
>
> Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-24 18:53     ` Mark Nelson
@ 2012-05-24 19:05       ` Stefan Priebe
  2012-05-25  1:53         ` Mark Nelson
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-24 19:05 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

Am 24.05.2012 20:53, schrieb Mark Nelson:
> Hi Stefan,
>
> Thanks for the info! I've been testing on 3.4 for the last couple of
> days but haven't run into that problem here. It looks like your journal
> has writes going to it quickly and then things stall as it tries to
> write out to your data disk.
That's a good point. Right now while testing i'm using a tmpfs ramdisk 
for the journal and have set journal dio = false in ceph.conf? Might 
this be the difference / problem?

3.2.18 works fine too.

 > I wonder if any of the data actually makes
> it to the disk... Can you run iostat or collectl or something and see
> what kind of write throughput you get to the OSD data disks?
none... so it seems get's never transferred from journal to disk.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-24 19:05       ` Stefan Priebe
@ 2012-05-25  1:53         ` Mark Nelson
  2012-05-25  8:19           ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-25  1:53 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel@vger.kernel.org

On 05/24/2012 02:05 PM, Stefan Priebe wrote:
> Am 24.05.2012 20:53, schrieb Mark Nelson:
>> Hi Stefan,
>>
>> Thanks for the info! I've been testing on 3.4 for the last couple of
>> days but haven't run into that problem here. It looks like your journal
>> has writes going to it quickly and then things stall as it tries to
>> write out to your data disk.
> That's a good point. Right now while testing i'm using a tmpfs ramdisk 
> for the journal and have set journal dio = false in ceph.conf? Might 
> this be the difference / problem?
>
> 3.2.18 works fine too.

Honestly I don't know if tmpfs journal with dio = false would lead to 
that kind of behavior.  Anything interesting in the logs if you turn 
debugging up?

>
> > I wonder if any of the data actually makes
>> it to the disk... Can you run iostat or collectl or something and see
>> what kind of write throughput you get to the OSD data disks?
> none... so it seems get's never transferred from journal to disk.

This might be a stupid question, but writes to those partitions work 
outside of Ceph with the new kernel right?

>
> Stefan

Thanks,
Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-25  1:53         ` Mark Nelson
@ 2012-05-25  8:19           ` Stefan Priebe - Profihost AG
  2012-05-25 11:31             ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-25  8:19 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

Am 25.05.2012 03:53, schrieb Mark Nelson:
> On 05/24/2012 02:05 PM, Stefan Priebe wrote:
>> 3.2.18 works fine too.
> 
> Honestly I don't know if tmpfs journal with dio = false would lead to
> that kind of behavior.  Anything interesting in the logs if you turn
> debugging up?

just stuff like this. But writing to the osd disk works - no idea why i
have seen a rate of 0 yesterday.

[INF] 2.2a scrub ok
2012-05-25 10:01:00.825442    pg v165: 768 pgs: 768 active+clean; 592 MB
data, 1181 MB used, 669 GB / 670 GB avail

2012-05-25 10:01:00.623252 osd.0 10.0.255.100:6800/7423 121 : [WRN] 1
slow requests, 1 included below; oldest blocked for > 30.042783 secs

2012-05-25 10:01:00.623259 osd.0 10.0.255.100:6800/7423 122 : [WRN] slow
request 30.042783 seconds old, received at 2012-05-25 10:00:30.580392:
osd_op(client.4111.0:74 proxmox1_154826_object73 [write 0~4194304]
0.5343bcc6) v4 currently waiting for sub ops


>> > I wonder if any of the data actually makes
>>> it to the disk... Can you run iostat or collectl or something and see
>>> what kind of write throughput you get to the OSD data disks?
>> none... so it seems get's never transferred from journal to disk.
> 
> This might be a stupid question, but writes to those partitions work
> outside of Ceph with the new kernel right?

I just tested with dd:
dd if=/dev/zero of=/srv/test bs=1M count=10000 oflag=direct

this gaves me a constant rate of 240MB/s on ALL osds.

Also an "ceph osd tell X bench" shows 260MB/s on all OSDs.

But when i use the rados bench i see the same for XFS and btrfs which
looks like an heavy up and down rate of the cur MB/s while doing the
rados bench.
See:
XFS:
http://pastebin.com/raw.php?i=8ahaePZw
btrfs:
http://pastebin.com/raw.php?i=BrwSC1yg

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-25  8:19           ` Stefan Priebe - Profihost AG
@ 2012-05-25 11:31             ` Stefan Priebe - Profihost AG
  2012-05-25 12:10               ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-25 11:31 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org


Some speed tests with different Kernel Versions. The same applies to
other FS like btrfs.
I used "rados -p data bench 100 write -t 16" for all tests and a freshly
created FS. mount options were always: noatime,nodiratime,nobarrier.

3.0.30 with XFS

speed is always between 120 and 160MB/s

Total time run:        100.510061
Total writes made:     3605
Write size:            4194304
Bandwidth (MB/sec):    143.468

Average Latency:       0.445714
Max latency:           1.99929
Min latency:           0.084812

3.2.18 with XFS

speed is between 40 and 170MB/s

Total time run:        100.795653
Total writes made:     3384
Write size:            4194304
Bandwidth (MB/sec):    134.292

Average Latency:       0.476297
Max latency:           2.92075
Min latency:           0.084884

3.3.7 with XFS

!! speed heavily jumps between 0 and 170 MB/s !!

Total time run:        107.398166
Total writes made:     2455
Write size:            4194304
Bandwidth (MB/sec):    91.435

Average Latency:       0.699819
Max latency:           13.8117
Min latency:           0.084624

3.4 with XFS

!! speed heavily jumps between 0 and 130 MB/s - most if the time it's
near 0 !!

Total time run:        115.433531
Total writes made:     468
Write size:            4194304
Bandwidth (MB/sec):    16.217

Average Latency:       3.9452
Max latency:           53.4356
Min latency:           0.091276


Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-25 11:31             ` Stefan Priebe - Profihost AG
@ 2012-05-25 12:10               ` Stefan Priebe - Profihost AG
  2012-05-25 15:47                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-25 12:10 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

Even with v3.3-rc1 is pretty often 0.

Am 25.05.2012 13:31, schrieb Stefan Priebe - Profihost AG:
> 
> Some speed tests with different Kernel Versions. The same applies to
> other FS like btrfs.
> I used "rados -p data bench 100 write -t 16" for all tests and a freshly
> created FS. mount options were always: noatime,nodiratime,nobarrier.
> 
> 3.0.30 with XFS
> 
> speed is always between 120 and 160MB/s
> 
> Total time run:        100.510061
> Total writes made:     3605
> Write size:            4194304
> Bandwidth (MB/sec):    143.468
> 
> Average Latency:       0.445714
> Max latency:           1.99929
> Min latency:           0.084812
> 
> 3.2.18 with XFS
> 
> speed is between 40 and 170MB/s
> 
> Total time run:        100.795653
> Total writes made:     3384
> Write size:            4194304
> Bandwidth (MB/sec):    134.292
> 
> Average Latency:       0.476297
> Max latency:           2.92075
> Min latency:           0.084884
> 
> 3.3.7 with XFS
> 
> !! speed heavily jumps between 0 and 170 MB/s !!
> 
> Total time run:        107.398166
> Total writes made:     2455
> Write size:            4194304
> Bandwidth (MB/sec):    91.435
> 
> Average Latency:       0.699819
> Max latency:           13.8117
> Min latency:           0.084624
> 
> 3.4 with XFS
> 
> !! speed heavily jumps between 0 and 130 MB/s - most if the time it's
> near 0 !!
> 
> Total time run:        115.433531
> Total writes made:     468
> Write size:            4194304
> Bandwidth (MB/sec):    16.217
> 
> Average Latency:       3.9452
> Max latency:           53.4356
> Min latency:           0.091276
> 
> 
> Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-25 12:10               ` Stefan Priebe - Profihost AG
@ 2012-05-25 15:47                 ` Alexandre DERUMIER
  2012-05-27  9:11                   ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-25 15:47 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel, Mark Nelson

Hi Stephan,
Do you have same performance with read ?

Did you have done some iostats ?
how much time to flush from journal to disks ?

----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Mark Nelson" <mark.nelson@inktank.com> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Vendredi 25 Mai 2012 14:10:16 
Objet: Re: poor OSD performance using kernel 3.4 

Even with v3.3-rc1 is pretty often 0. 

Am 25.05.2012 13:31, schrieb Stefan Priebe - Profihost AG: 
> 
> Some speed tests with different Kernel Versions. The same applies to 
> other FS like btrfs. 
> I used "rados -p data bench 100 write -t 16" for all tests and a freshly 
> created FS. mount options were always: noatime,nodiratime,nobarrier. 
> 
> 3.0.30 with XFS 
> 
> speed is always between 120 and 160MB/s 
> 
> Total time run: 100.510061 
> Total writes made: 3605 
> Write size: 4194304 
> Bandwidth (MB/sec): 143.468 
> 
> Average Latency: 0.445714 
> Max latency: 1.99929 
> Min latency: 0.084812 
> 
> 3.2.18 with XFS 
> 
> speed is between 40 and 170MB/s 
> 
> Total time run: 100.795653 
> Total writes made: 3384 
> Write size: 4194304 
> Bandwidth (MB/sec): 134.292 
> 
> Average Latency: 0.476297 
> Max latency: 2.92075 
> Min latency: 0.084884 
> 
> 3.3.7 with XFS 
> 
> !! speed heavily jumps between 0 and 170 MB/s !! 
> 
> Total time run: 107.398166 
> Total writes made: 2455 
> Write size: 4194304 
> Bandwidth (MB/sec): 91.435 
> 
> Average Latency: 0.699819 
> Max latency: 13.8117 
> Min latency: 0.084624 
> 
> 3.4 with XFS 
> 
> !! speed heavily jumps between 0 and 130 MB/s - most if the time it's 
> near 0 !! 
> 
> Total time run: 115.433531 
> Total writes made: 468 
> Write size: 4194304 
> Bandwidth (MB/sec): 16.217 
> 
> Average Latency: 3.9452 
> Max latency: 53.4356 
> Min latency: 0.091276 
> 
> 
> Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-25 15:47                 ` Alexandre DERUMIER
@ 2012-05-27  9:11                   ` Stefan Priebe - Profihost AG
  2012-05-27 11:33                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-27  9:11 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Nelson

Can really nobody help?

Am 25.05.2012 17:47, schrieb Alexandre DERUMIER:
> Hi Stephan,
> Do you have same performance with read ?

Read is fine for both versions see here:

3.0.30

Write:
Total time run:        30.872357
Total writes made:     1095
Write size:            4194304
Bandwidth (MB/sec):    141.874

Average Latency:       0.450187
Max latency:           2.00672
Min latency:           0.091783

Read:
Total time run:        22.907021
Total reads made:     1095
Read size:            4194304
Bandwidth (MB/sec):    191.208

Average Latency:       0.333954
Max latency:           1.71987
Min latency:           0.041373

3.4.0

Write:
Total time run:        124.573247
Total writes made:     647
Write size:            4194304
Bandwidth (MB/sec):    20.775

Average Latency:       3.08058
Max latency:           65.2522
Min latency:           0.089587

Read:
Total time run:        13.191562
Total reads made:     647
Read size:            4194304
Bandwidth (MB/sec):    196.186

Average Latency:       0.322895
Max latency:           1.22392
Min latency:           0.043784


> Did you have done some iostats ?
Yes - I/O is heavily jumping between 0 and 60MB/s but of the time it's 0 
or around 10MB/s.

> how much time to flush from journal to disks ?
I don't know how to measure this. As ceph starts to write to journal and 
disk in parallel and tmpfs isn't even shown in iostat.

Greets,
Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-27  9:11                   ` Stefan Priebe - Profihost AG
@ 2012-05-27 11:33                     ` Alexandre DERUMIER
  2012-05-27 18:57                       ` Stefan Priebe
  0 siblings, 1 reply; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-27 11:33 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel, Mark Nelson

> how much time to flush from journal to disks ?
>>I don't know how to measure this. 
Do an iostat, you must see timelapse of write inactivity on disk (datas are written to journal) , then after a timelapse of write activity on disk.(data flushed from journal to disk)

>>As ceph starts to write to journal and 
>>disk in parallel 

this is strange, from doc:
http://ceph.com/wiki/OSD_journal

the journal mode should be write-ahead with xfs.
So write to journal first then flush to disk each 30sec.

maybe your tmpfs is too small, and flushs occurs at 50% of free space on journal.
If by exemple, your flush occurs each 1 or 2seconds, this can cause very slow write.



>>and tmpfs isn't even shown in iostat.
indeed, iostat doesn't work with tmpfs...


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org, "Mark Nelson" <mark.nelson@inktank.com> 
Envoyé: Dimanche 27 Mai 2012 11:11:13 
Objet: Re: poor OSD performance using kernel 3.4 

Can really nobody help? 

Am 25.05.2012 17:47, schrieb Alexandre DERUMIER: 
> Hi Stephan, 
> Do you have same performance with read ? 

Read is fine for both versions see here: 

3.0.30 

Write: 
Total time run: 30.872357 
Total writes made: 1095 
Write size: 4194304 
Bandwidth (MB/sec): 141.874 

Average Latency: 0.450187 
Max latency: 2.00672 
Min latency: 0.091783 

Read: 
Total time run: 22.907021 
Total reads made: 1095 
Read size: 4194304 
Bandwidth (MB/sec): 191.208 

Average Latency: 0.333954 
Max latency: 1.71987 
Min latency: 0.041373 

3.4.0 

Write: 
Total time run: 124.573247 
Total writes made: 647 
Write size: 4194304 
Bandwidth (MB/sec): 20.775 

Average Latency: 3.08058 
Max latency: 65.2522 
Min latency: 0.089587 

Read: 
Total time run: 13.191562 
Total reads made: 647 
Read size: 4194304 
Bandwidth (MB/sec): 196.186 

Average Latency: 0.322895 
Max latency: 1.22392 
Min latency: 0.043784 


> Did you have done some iostats ? 
Yes - I/O is heavily jumping between 0 and 60MB/s but of the time it's 0 
or around 10MB/s. 

> how much time to flush from journal to disks ? 
I don't know how to measure this. As ceph starts to write to journal and 
disk in parallel and tmpfs isn't even shown in iostat. 

Greets, 
Stefan 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-27 11:33                     ` Alexandre DERUMIER
@ 2012-05-27 18:57                       ` Stefan Priebe
  2012-05-28  5:37                         ` Alexandre DERUMIER
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-27 18:57 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Nelson

Am 27.05.2012 13:33, schrieb Alexandre DERUMIER:
>> how much time to flush from journal to disks ?
>>> I don't know how to measure this.
> Do an iostat, you must see timelapse of write inactivity on disk (datas are written to journal) , then after a timelapse
 > of write activity on disk.(data flushed from journal to disk)
No it always starts in parallel. Journal is set to 1GB. I've now moved 
the journal to disk - so i can use iostat.

>>> As ceph starts to write to journal and
>>> disk in parallel
>
> this is strange, from doc:
> http://ceph.com/wiki/OSD_journal
>
> the journal mode should be write-ahead with xfs.
> So write to journal first then flush to disk each 30sec.
I'm not quite sure as:
http://ceph.com/wiki/Ceph.conf#filestore_journal_writeahead

says there are two options:
filestore journal writeahead
and
filestore journal parallel
but even
filestore journal writeahead = 1
filestore journal parallel = 0

results in a parallel start.

> maybe your tmpfs is too small, and flushs occurs at 50% of free space on journal.
> If by exemple, your flush occurs each 1 or 2seconds, this can cause very slow write.
1GB? My 1Gbit/s LAN test connection can't handle more than about 
120MB/s. So there's at least room for 8-10s.

;-(

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-27 18:57                       ` Stefan Priebe
@ 2012-05-28  5:37                         ` Alexandre DERUMIER
  2012-05-28  6:25                           ` Stefan Priebe
  0 siblings, 1 reply; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-28  5:37 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel, Mark Nelson

I think filestore journal parallel works only with btrfs.
Other filesystem are writeahead.

if you write at 120MB/S, so your journal of 1GB is at 50% in 4sec.

So you got around 480MB each 4sec, does your disks can flush sequentially these 480MB in less than 4sec ?
(do a small benchmark of your disk in local filesystem, without ceph)

If not, you can have spikes in your write stats if the journal.

simple schema if disks are not fast enough:

0-4sec
------
random write (first wave 480MB) --->journal

4-8sec
------
random write (second wave)---->journal---->write flush of first wave(480MB) --->disks

8-12sec
-------
random write (thirst wave) blocked ---->journal---->write of second wave-blocked---->write flush of first wave not yet finished(480MB) --->disks



good schema
-----------
0-4sec
------
random write (first wave 480MB) --->journal

4-8sec
------
random write (second wave)---->journal---->write flush of first wave(480MB) --->disks

8-12sec
-------
random write (thirst wave)---->journal---->write of second wave(480MB) --->disks



So, with a bigger journal, you have more datas to write to disks, so you can write more datas sequentially in 1 flush.
4sec seem very low, you need to have 20-30sec between flush.

How many disks (7,2K) do you have by osd ?




----- Mail original ----- 

De: "Stefan Priebe" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org, "Mark Nelson" <mark.nelson@inktank.com> 
Envoyé: Dimanche 27 Mai 2012 20:57:23 
Objet: Re: poor OSD performance using kernel 3.4 

Am 27.05.2012 13:33, schrieb Alexandre DERUMIER: 
>> how much time to flush from journal to disks ? 
>>> I don't know how to measure this. 
> Do an iostat, you must see timelapse of write inactivity on disk (datas are written to journal) , then after a timelapse 
> of write activity on disk.(data flushed from journal to disk) 
No it always starts in parallel. Journal is set to 1GB. I've now moved 
the journal to disk - so i can use iostat. 

>>> As ceph starts to write to journal and 
>>> disk in parallel 
> 
> this is strange, from doc: 
> http://ceph.com/wiki/OSD_journal 
> 
> the journal mode should be write-ahead with xfs. 
> So write to journal first then flush to disk each 30sec. 
I'm not quite sure as: 
http://ceph.com/wiki/Ceph.conf#filestore_journal_writeahead 

says there are two options: 
filestore journal writeahead 
and 
filestore journal parallel 
but even 
filestore journal writeahead = 1 
filestore journal parallel = 0 

results in a parallel start. 

> maybe your tmpfs is too small, and flushs occurs at 50% of free space on journal. 
> If by exemple, your flush occurs each 1 or 2seconds, this can cause very slow write. 
1GB? My 1Gbit/s LAN test connection can't handle more than about 
120MB/s. So there's at least room for 8-10s. 

;-( 

Stefan 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-28  5:37                         ` Alexandre DERUMIER
@ 2012-05-28  6:25                           ` Stefan Priebe
  2012-05-28  6:52                             ` Alexandre DERUMIER
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-28  6:25 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Nelson

Am 28.05.2012 07:37, schrieb Alexandre DERUMIER:
> I think filestore journal parallel works only with btrfs.
> Other filesystem are writeahead.
... you might be right but i can't change ceph's implementation.

> if you write at 120MB/S, so your journal of 1GB is at 50% in 4sec.
>
> So you got around 480MB each 4sec, does your disks can flush sequentially these 480MB in less than 4sec ?
> (do a small benchmark of your disk in local filesystem, without ceph)
>
> If not, you can have spikes in your write stats if the journal.
>
> simple schema if disks are not fast enough:
I totally aggree with you but this is just a test setup AND if you have 
a big log file to copy let's say 100GB your journal will never be big 
enough and the speed should never drop to 0MB/s. Also i see the correct 
behaviour with 3.0.X where the speed is maxed to the underlying device. 
So i still see no reason that with 3.4 the speed drops to 0MB/s and is 
mostly 10-20MB/s instead of 130MB/s.

> How many disks (7,2K) do you have by osd ?
One intel 520 SSD per OSD.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-28  6:25                           ` Stefan Priebe
@ 2012-05-28  6:52                             ` Alexandre DERUMIER
  2012-05-28 19:48                               ` Stefan Priebe
  0 siblings, 1 reply; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-28  6:52 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel, Mark Nelson

> I think filestore journal parallel works only with btrfs. 
> Other filesystem are writeahead. 
>>... you might be right but i can't change ceph's implementation. 

See my schema,
I think you see parallel writes, because you see flush write of first wave to disk, in the same time 
of second wave write to journal.


>>I totally aggree with you but this is just a test setup AND if you have 
>>a big log file to copy let's say 100GB your journal will never be big 
>>enough and the speed should never drop to 0MB/s. Also i see the correct 
>>behaviour with 3.0.X where the speed is maxed to the underlying device. 
>>So i still see no reason that with 3.4 the speed drops to 0MB/s and is 
>>mostly 10-20MB/s instead of 130MB/s. 

Maybe something is wrong with 3.4, then your disk write more slowly. (xfs bug, sata driver controller bug, ...)
on my schema:
Enough slowly to have the third wave to block on the journal. (so 0MB/S)

maybe some local benchmark of your ssd with 3.4 can give some tips ?

>> How many disks (7,2K) do you have by osd ? 
>>>One intel 520 SSD per OSD. 

I see some benchmark on internet about 150-300MB/s (depend of the blocksize).

Something must be wrong, Doing local benchmark can really help I think.
You can use sysbench-tools
https://github.com/tsuna/sysbench-tools
It make bench compare with nice graphs.



----- Mail original ----- 

De: "Stefan Priebe" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org, "Mark Nelson" <mark.nelson@inktank.com> 
Envoyé: Lundi 28 Mai 2012 08:25:24 
Objet: Re: poor OSD performance using kernel 3.4 

Am 28.05.2012 07:37, schrieb Alexandre DERUMIER: 
> I think filestore journal parallel works only with btrfs. 
> Other filesystem are writeahead. 
... you might be right but i can't change ceph's implementation. 

> if you write at 120MB/S, so your journal of 1GB is at 50% in 4sec. 
> 
> So you got around 480MB each 4sec, does your disks can flush sequentially these 480MB in less than 4sec ? 
> (do a small benchmark of your disk in local filesystem, without ceph) 
> 
> If not, you can have spikes in your write stats if the journal. 
> 
> simple schema if disks are not fast enough: 
I totally aggree with you but this is just a test setup AND if you have 
a big log file to copy let's say 100GB your journal will never be big 
enough and the speed should never drop to 0MB/s. Also i see the correct 
behaviour with 3.0.X where the speed is maxed to the underlying device. 
So i still see no reason that with 3.4 the speed drops to 0MB/s and is 
mostly 10-20MB/s instead of 130MB/s. 

> How many disks (7,2K) do you have by osd ? 
One intel 520 SSD per OSD. 

Stefan 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-28  6:52                             ` Alexandre DERUMIER
@ 2012-05-28 19:48                               ` Stefan Priebe
  2012-05-29  3:54                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-28 19:48 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Nelson

Am 28.05.2012 08:52, schrieb Alexandre DERUMIER:
>> I think filestore journal parallel works only with btrfs.
>> Other filesystem are writeahead.
>>> ... you might be right but i can't change ceph's implementation.
>
> See my schema,
> I think you see parallel writes, because you see flush write of first wave to disk, in the same time
> of second wave write to journal.
Yes i fulllý understand and agree - but still this should at least 
result in a constant bandwidth near max of underlying disk.

>>> I totally aggree with you but this is just a test setup AND if you have
>>> a big log file to copy let's say 100GB your journal will never be big
>>> enough and the speed should never drop to 0MB/s. Also i see the correct
>>> behaviour with 3.0.X where the speed is maxed to the underlying device.
>>> So i still see no reason that with 3.4 the speed drops to 0MB/s and is
>>> mostly 10-20MB/s instead of 130MB/s.
>
> Maybe something is wrong with 3.4, then your disk write more slowly. (xfs bug, sata driver controller bug, ...)

This happens with ext4 or btrfs too.

Squential write speed to FS is exactly the same under 3.0 and 3.4 using 
oflag=direct.

3.4:
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 41,4899 s, 253 MB/s

3.0:
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 40,861 s, 257 MB/s

> maybe some local benchmark of your ssd with 3.4 can give some tips ?

>>> How many disks (7,2K) do you have by osd ?
>>>> One intel 520 SSD per OSD.
>
> I see some benchmark on internet about 150-300MB/s (depend of the blocksize).
bench OSD shows around 260MB/s

ceph osd tell X bench shows me a speed of 260MB/s under both kernels 
which corresponds to the dd from above.

> Something must be wrong, Doing local benchmark can really help I think.
> You can use sysbench-tools
> https://github.com/tsuna/sysbench-tools
> It make bench compare with nice graphs.
Thx hopefully i'll find something.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-28 19:48                               ` Stefan Priebe
@ 2012-05-29  3:54                                 ` Alexandre DERUMIER
  2012-05-29  8:22                                   ` Stefan Priebe - Profihost AG
  2012-05-29  9:46                                   ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-29  3:54 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: ceph-devel, Mark Nelson

>> This happens with ext4 or btrfs too. 

maybe this is related to io scheduler ?

did you have compared cfq,deadline,noop scheduler ?

noop should be fast with ssd.


also what's is your sas/sata controller  ?

----- Mail original ----- 

De: "Stefan Priebe" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org, "Mark Nelson" <mark.nelson@inktank.com> 
Envoyé: Lundi 28 Mai 2012 21:48:34 
Objet: Re: poor OSD performance using kernel 3.4 

Am 28.05.2012 08:52, schrieb Alexandre DERUMIER: 
>> I think filestore journal parallel works only with btrfs. 
>> Other filesystem are writeahead. 
>>> ... you might be right but i can't change ceph's implementation. 
> 
> See my schema, 
> I think you see parallel writes, because you see flush write of first wave to disk, in the same time 
> of second wave write to journal. 
Yes i fulllý understand and agree - but still this should at least 
result in a constant bandwidth near max of underlying disk. 

>>> I totally aggree with you but this is just a test setup AND if you have 
>>> a big log file to copy let's say 100GB your journal will never be big 
>>> enough and the speed should never drop to 0MB/s. Also i see the correct 
>>> behaviour with 3.0.X where the speed is maxed to the underlying device. 
>>> So i still see no reason that with 3.4 the speed drops to 0MB/s and is 
>>> mostly 10-20MB/s instead of 130MB/s. 
> 
> Maybe something is wrong with 3.4, then your disk write more slowly. (xfs bug, sata driver controller bug, ...) 

This happens with ext4 or btrfs too. 

Squential write speed to FS is exactly the same under 3.0 and 3.4 using 
oflag=direct. 

3.4: 
10000+0 records in 
10000+0 records out 
10485760000 bytes (10 GB) copied, 41,4899 s, 253 MB/s 

3.0: 
10000+0 records in 
10000+0 records out 
10485760000 bytes (10 GB) copied, 40,861 s, 257 MB/s 

> maybe some local benchmark of your ssd with 3.4 can give some tips ? 

>>> How many disks (7,2K) do you have by osd ? 
>>>> One intel 520 SSD per OSD. 
> 
> I see some benchmark on internet about 150-300MB/s (depend of the blocksize). 
bench OSD shows around 260MB/s 

ceph osd tell X bench shows me a speed of 260MB/s under both kernels 
which corresponds to the dd from above. 

> Something must be wrong, Doing local benchmark can really help I think. 
> You can use sysbench-tools 
> https://github.com/tsuna/sysbench-tools 
> It make bench compare with nice graphs. 
Thx hopefully i'll find something. 

Stefan 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29  3:54                                 ` Alexandre DERUMIER
@ 2012-05-29  8:22                                   ` Stefan Priebe - Profihost AG
  2012-05-29 13:01                                     ` Alexandre DERUMIER
  2012-05-29  9:46                                   ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-29  8:22 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Nelson

Am 29.05.2012 05:54, schrieb Alexandre DERUMIER:
>>> This happens with ext4 or btrfs too. 
> 
> maybe this is related to io scheduler ?
> did you have compared cfq,deadline,noop scheduler ?

This is something i consider for performance tuning later on, when
everything is running smooth. Right now i'm using CFQ with the tuned IBM
settings (which proxmox uses too).


Here are some outputs of basic fio Tests running on 3.4 and 3.0.

3.4: http://pastebin.com/raw.php?i=6GEKsCYH
3.0: http://pastebin.com/raw.php?i=FU4AtUck

strangely 3.4 is faster but this corresponds to the fact that the normal
Disk I/O is working fine with 3.4 It's just ceph which isn't working fine.

> also what's is your sas/sata controller  ?
Intel onboard SATA controller in this testsetup.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29  3:54                                 ` Alexandre DERUMIER
  2012-05-29  8:22                                   ` Stefan Priebe - Profihost AG
@ 2012-05-29  9:46                                   ` Stefan Priebe - Profihost AG
  2012-05-29 13:39                                     ` Yann Dupont
  1 sibling, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-29  9:46 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Nelson

It would be really nice if somebody from inktank can comment this whole
sitation.

Thanks!

Stefan

Am 29.05.2012 05:54, schrieb Alexandre DERUMIER:
>>> This happens with ext4 or btrfs too. 
> 
> maybe this is related to io scheduler ?
> 
> did you have compared cfq,deadline,noop scheduler ?
> 
> noop should be fast with ssd.
> 
> 
> also what's is your sas/sata controller  ?
> 
> ----- Mail original ----- 
> 
> De: "Stefan Priebe" <s.priebe@profihost.ag> 
> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: ceph-devel@vger.kernel.org, "Mark Nelson" <mark.nelson@inktank.com> 
> Envoyé: Lundi 28 Mai 2012 21:48:34 
> Objet: Re: poor OSD performance using kernel 3.4 
> 
> Am 28.05.2012 08:52, schrieb Alexandre DERUMIER: 
>>> I think filestore journal parallel works only with btrfs. 
>>> Other filesystem are writeahead. 
>>>> ... you might be right but i can't change ceph's implementation. 
>>
>> See my schema, 
>> I think you see parallel writes, because you see flush write of first wave to disk, in the same time 
>> of second wave write to journal. 
> Yes i fulllý understand and agree - but still this should at least 
> result in a constant bandwidth near max of underlying disk. 
> 
>>>> I totally aggree with you but this is just a test setup AND if you have 
>>>> a big log file to copy let's say 100GB your journal will never be big 
>>>> enough and the speed should never drop to 0MB/s. Also i see the correct 
>>>> behaviour with 3.0.X where the speed is maxed to the underlying device. 
>>>> So i still see no reason that with 3.4 the speed drops to 0MB/s and is 
>>>> mostly 10-20MB/s instead of 130MB/s. 
>>
>> Maybe something is wrong with 3.4, then your disk write more slowly. (xfs bug, sata driver controller bug, ...) 
> 
> This happens with ext4 or btrfs too. 
> 
> Squential write speed to FS is exactly the same under 3.0 and 3.4 using 
> oflag=direct. 
> 
> 3.4: 
> 10000+0 records in 
> 10000+0 records out 
> 10485760000 bytes (10 GB) copied, 41,4899 s, 253 MB/s 
> 
> 3.0: 
> 10000+0 records in 
> 10000+0 records out 
> 10485760000 bytes (10 GB) copied, 40,861 s, 257 MB/s 
> 
>> maybe some local benchmark of your ssd with 3.4 can give some tips ? 
> 
>>>> How many disks (7,2K) do you have by osd ? 
>>>>> One intel 520 SSD per OSD. 
>>
>> I see some benchmark on internet about 150-300MB/s (depend of the blocksize). 
> bench OSD shows around 260MB/s 
> 
> ceph osd tell X bench shows me a speed of 260MB/s under both kernels 
> which corresponds to the dd from above. 
> 
>> Something must be wrong, Doing local benchmark can really help I think. 
>> You can use sysbench-tools 
>> https://github.com/tsuna/sysbench-tools 
>> It make bench compare with nice graphs. 
> Thx hopefully i'll find something. 
> 
> Stefan 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29  8:22                                   ` Stefan Priebe - Profihost AG
@ 2012-05-29 13:01                                     ` Alexandre DERUMIER
  2012-05-29 14:18                                       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-29 13:01 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel, Mark Nelson

fio benchmark will give you raw device performance bypassing filesystem.

So maybe the problem is in xfs or linux vfs layer.

I think you need to bench the filesystem to compare performance


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org, "Mark Nelson" <mark.nelson@inktank.com> 
Envoyé: Mardi 29 Mai 2012 10:22:34 
Objet: Re: poor OSD performance using kernel 3.4 

Am 29.05.2012 05:54, schrieb Alexandre DERUMIER: 
>>> This happens with ext4 or btrfs too. 
> 
> maybe this is related to io scheduler ? 
> did you have compared cfq,deadline,noop scheduler ? 

This is something i consider for performance tuning later on, when 
everything is running smooth. Right now i'm using CFQ with the tuned IBM 
settings (which proxmox uses too). 


Here are some outputs of basic fio Tests running on 3.4 and 3.0. 

3.4: http://pastebin.com/raw.php?i=6GEKsCYH 
3.0: http://pastebin.com/raw.php?i=FU4AtUck 

strangely 3.4 is faster but this corresponds to the fact that the normal 
Disk I/O is working fine with 3.4 It's just ceph which isn't working fine. 

> also what's is your sas/sata controller ? 
Intel onboard SATA controller in this testsetup. 

Stefan 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29  9:46                                   ` Stefan Priebe - Profihost AG
@ 2012-05-29 13:39                                     ` Yann Dupont
  2012-05-29 14:43                                       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Yann Dupont @ 2012-05-29 13:39 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote:
> It would be really nice if somebody from inktank can comment this whole
> sitation.
>
Hello.
I think I have the same bug :

My setup is with 8 OSD nodes, 3 MDS (1 active) & 3 MON.
All my machines are debian, using a custom 3.4.0 kernel. Ceph is 
0.47.2-1~bpo60+1 (debian package)

root@label5:~#  rados -p data bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        99        83     331.9       332  0.059756 0.0946512
     2      16       141       125   249.946       168  0.049822  0.212338
     3      16       166       150   199.963       100  0.057352  0.257179
     4      16       227       211   210.965       244  0.043592  0.265005
     5      16       257       241   192.767       120  0.040883  0.276718
     6      16       260       244   162.641        12   1.59593  0.293439
     7      16       319       303   173.118       236  0.056913  0.357856
     8      16       348       332   165.976       116  0.052954  0.332424
     9      16       348       332   147.535         0         -  0.332424
    10      16       472       456   182.374       248  0.038543  0.343745
    11      16       485       469   170.522        52  0.040475  0.347328
    12      16       485       469   156.312         0         -  0.347328
    13      16       517       501   154.133        64  0.047759  0.378595
    14      16       562       546    155.98       180  0.042814  0.395036
    15      16       563       547   145.847         4  0.045834  0.394398
    16      16       563       547   136.732         0         -  0.394398
    17      16       563       547   128.689         0         -  0.394398
    18      16       667       651   144.648   138.667   0.06501  0.440847
    19      16       703       687   144.613       144  0.040772  0.421935
min lat: 0.030505 max lat: 5.05834 avg lat: 0.421935
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20      16       703       687   137.382         0         -  0.421935
    21      16       704       688   131.031         2   2.65675  0.425184
    22      14       704       690   125.439         8   3.26857  0.433417
Total time run:        22.042041
Total writes made:     704
Write size:            4194304
Bandwidth (MB/sec):    127.756

Average Latency:       0.498932
Max latency:           5.05834
Min latency:           0.030505


What puzzle me is if I test with pool rbd instead :


root@label5:~#  rados -p rbd bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       191       175   699.782       700  0.236737 0.0841979
     2      16       397       381   761.837       824  0.065643 0.0813094
     3      16       602       586   781.193       820   0.07921 0.0808584
     4      16       815       799    798.88       852  0.066597 0.0785906
     5      16      1026      1010   807.885       844   0.10364 0.0785475
     6      16      1249      1233   821.886       892  0.069324 0.0773951
     7      16      1461      1445   825.608       848  0.053176 0.0770628
     8      16      1680      1664   831.895       876   0.09612 0.0765263
     9      16      1897      1881   835.891       868  0.100736 0.0761617
    10      16      2105      2089   835.491       832  0.114913 0.0761897
    11      16      2329      2313   840.983       896  0.042009 0.0758589
    12      16      2553      2537   845.559       896   0.07017 0.0754364
    13      16      2786      2770   852.203       932  0.066365 0.0749136
    14      16      3009      2993   855.041       892   0.06491 0.0746046
    15      16      3228      3212   856.431       876   0.05698 0.0745573
    16      16      3437      3421   855.148       836  0.062162 0.0746339
    17      16      3652      3636   855.428       860  0.140451  0.074534
    18      16      3878      3862   858.121       904  0.081505 0.0743125
    19      16      4106      4090   860.952       912  0.079922 0.0742146
min lat: 0.032342 max lat: 0.63151 avg lat: 0.0741575
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20      16      4324      4308   861.495       872   0.06199 0.0741575
Total time run:        20.102264
Total writes made:     4325
Write size:            4194304
Bandwidth (MB/sec):    860.600

Average Latency:       0.0743131
Max latency:           0.63151
Min latency:           0.032342


As you can see, much more stable bandwith with this pool.

I understand data & rbd pool probably don't use the same internals, but 
is this difference expected ?

disclaimer: By no mean I'm a ceph expert, I'm just experimenting with 
it, and still don't understand all the internals.


Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 13:01                                     ` Alexandre DERUMIER
@ 2012-05-29 14:18                                       ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-29 14:18 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Nelson

Am 29.05.2012 15:01, schrieb Alexandre DERUMIER:
> fio benchmark will give you raw device performance bypassing filesystem.
> 
> So maybe the problem is in xfs or linux vfs layer.
> 
> I think you need to bench the filesystem to compare performance
here another test with bonnie, which shows the same:
http://pastebin.com/raw.php?i=fGTt4NLi

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 13:39                                     ` Yann Dupont
@ 2012-05-29 14:43                                       ` Stefan Priebe - Profihost AG
  2012-05-29 17:50                                         ` Mark Nelson
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-29 14:43 UTC (permalink / raw)
  To: Yann Dupont; +Cc: ceph-devel

Am 29.05.2012 15:39, schrieb Yann Dupont:
> On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote:
>> It would be really nice if somebody from inktank can comment this whole
>> sitation.
>>
> Hello.
> I think I have the same bug :
> 
> My setup is with 8 OSD nodes, 3 MDS (1 active) & 3 MON.
> All my machines are debian, using a custom 3.4.0 kernel. Ceph is
> 0.47.2-1~bpo60+1 (debian package)

That sounds absolutely like the same issue. Sadly nobody from inktank
has replied to this problems for the last days.

> As you can see, much more stable bandwith with this pool.
That's pretty strange...

> I understand data & rbd pool probably don't use the same internals, but
> is this difference expected ?

There must be differences in pool handling.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 14:43                                       ` Stefan Priebe - Profihost AG
@ 2012-05-29 17:50                                         ` Mark Nelson
  2012-05-29 19:50                                           ` Yann Dupont
                                                             ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Mark Nelson @ 2012-05-29 17:50 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Yann Dupont, ceph-devel

On 05/29/2012 09:43 AM, Stefan Priebe - Profihost AG wrote:
> Am 29.05.2012 15:39, schrieb Yann Dupont:
>> On 29/05/2012 11:46, Stefan Priebe - Profihost AG wrote:
>>> It would be really nice if somebody from inktank can comment this whole
>>> sitation.
>>>
>> Hello.
>> I think I have the same bug :
>>
>> My setup is with 8 OSD nodes, 3 MDS (1 active)&  3 MON.
>> All my machines are debian, using a custom 3.4.0 kernel. Ceph is
>> 0.47.2-1~bpo60+1 (debian package)
> That sounds absolutely like the same issue. Sadly nobody from inktank
> has replied to this problems for the last days.

Sorry about that, yesterday was a holiday in the US.

I did some quick tests on a couple of nodes I had laying around this 
morning.

Distro: Oneiric (IE no syncfs in glibc)
Ceph: 0.46-65-gf6c5dff

1 1GbE Client node
3 1GbE Mon nodes
2 1GbE OSD nodes with 1 OSD on each mounted on a 7200rpm SAS drive.  
btrfs with -l 64k -n64k, mounted using noatime.  H700 Raid controller 
with each drive in a 1 disk raid0.  Journals are partitioned on a 
separate drive.

/proc/version:
Linux version 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64)

rados -p data bench 120 write:

Total time run:        120.601286
Total writes made:     2979
Write size:            4194304
Bandwidth (MB/sec):    98.805

Average Latency:       0.647507
Max latency:           1.39966
Min latency:           0.181663

Once I get these nodes up to 0.47 and get them switched over to 10GbE 
I'll redo the btrfs tests and try out xfs as well with longer running tests.

>> As you can see, much more stable bandwith with this pool.
> That's pretty strange...

Indeed, that is very strange!  Can you check to see how many pgs are in 
each?  Any difference in replication level?  You can check with:

ceph osd pool get <pool> size
ceph osd pool get <pool> pg_num

>> I understand data&  rbd pool probably don't use the same internals, but
>> is this difference expected ?
> There must be differences in pool handling.
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 17:50                                         ` Mark Nelson
@ 2012-05-29 19:50                                           ` Yann Dupont
  2012-05-29 21:04                                           ` Stefan Priebe
  2012-05-29 21:08                                           ` Stefan Priebe
  2 siblings, 0 replies; 73+ messages in thread
From: Yann Dupont @ 2012-05-29 19:50 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Priebe - Profihost AG, ceph-devel

Le 29/05/2012 19:50, Mark Nelson a écrit :
>
> 1 1GbE Client node
> 3 1GbE Mon nodes
> 2 1GbE OSD nodes with 1 OSD on each mounted on a 7200rpm SAS drive.  
> btrfs with -l 64k -n64k, mounted using noatime.  H700 Raid controller 
> with each drive in a 1 disk raid0.  Journals are partitioned on a 
> separate drive.
>
Hello ,
Forgot to mention I'm using 10 Gbe and FS using btrfs with -l 64k -n64k, 
but also space_cache,compress=lzo,nobarrier,noatime.
journal is on tmpfs :

  osd journal = /dev/shm/journal
  osd journal size = 6144

Remember It's not a production system for the moment. I'm just trying to 
evaluate what is the best performance I can get. (and if the system is 
stable enough to start alpha/pre-production services). BTW, I noticed 
OSD usings XFS are much much slower than OSD with btrfs right now, 
particulary in rbd tests. btrfs have some stability problems, even if 
with newer kernels it seems better.

> /proc/version:
> Linux version 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64)
>
> rados -p data bench 120 write:
>
> Total time run:        120.601286
> Total writes made:     2979
> Write size:            4194304
> Bandwidth (MB/sec):    98.805
>
> Average Latency:       0.647507
> Max latency:           1.39966
> Min latency:           0.181663
>
> Once I get these nodes up to 0.47 and get them switched over to 10GbE 
> I'll redo the btrfs tests and try out xfs as well with longer running 
> tests.
>
>>> As you can see, much more stable bandwith with this pool.
>> That's pretty strange...
>
> Indeed, that is very strange!  Can you check to see how many pgs are 
> in each?  Any difference in replication level?  You can check with:
>
> ceph osd pool get <pool> size
root@label5:~# ceph osd pool get data size
don't know how to get pool field size
root@label5:~# ceph osd pool get rbd size
don't know how to get pool field size

Is size the good name of the field ? In the the wiki size isn't listed 
as a valid field

> ceph osd pool get <pool> pg_num
>
root@label5:~# ceph osd pool get rbd pg_num
PG_NUM: 576
root@label5:~# ceph osd pool get data pg_num
PG_NUM: 576


Th pg num is quite low because I started with small OSD (9 osd with 200G 
each - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 
is out) but with much larger (& faster) storage. 6 OSD have 5T on it, 2 
have still 200G but they are planned to migrate before the end of the week.

I try, for the moment, to keep the OSD similars. Replication is set to 2.

No OSD is full, I don't have much data stored for the moment.

Concerning crush map, I'm not using the default one :

The 8 nodes are in 3 different locations (some kilometers away). 2 are 
in 1 place, 2 in another, and the 4 last in the principal place.
I try to group host together to avoid problem when I loose a location 
(electrical problem, for example). Not sure I really customized the 
crush map as I should have.

here is the map :
  begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 pool

# buckets
host karuizawa {
     id -5        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.2 weight 1.000
}
host hazelburn {
     id -6        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.3 weight 1.000
}
rack loire {
     id -3        # do not change unnecessarily
     # weight 2.000
     alg straw
     hash 0    # rjenkins1
     item karuizawa weight 1.000
     item hazelburn weight 1.000
}
host carsebridge {
     id -8        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.5 weight 1.000
}
host cameronbridge {
     id -9        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.6 weight 1.000
}
rack chantrerie {
     id -7        # do not change unnecessarily
     # weight 2.000
     alg straw
     hash 0    # rjenkins1
     item carsebridge weight 1.000
     item cameronbridge weight 1.000
}
host chichibu {
     id -2        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.0 weight 1.000
}
host glenesk {
     id -4        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.1 weight 1.000
}
host braeval {
     id -10        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.7 weight 1.000
}
host hanyu {
     id -11        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.8 weight 1.000
}
rack lombarderie {
     id -12        # do not change unnecessarily
     # weight 4.000
     alg straw
     hash 0    # rjenkins1
     item chichibu weight 1.000
     item glenesk weight 1.000
     item braeval weight 1.000
     item hanyu weight 1.000
}
pool default {
     id -1        # do not change unnecessarily
     # weight 8.000
     alg straw
     hash 0    # rjenkins1
     item loire weight 2.000
     item chantrerie weight 2.000
     item lombarderie weight 4.000
}

# rules
rule data {
     ruleset 0
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type host
     step emit
}
rule metadata {
     ruleset 1
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type host
     step emit
}
rule rbd {
     ruleset 2
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type host
     step emit
}

# end crush map

Hope it helps,
cheers

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 17:50                                         ` Mark Nelson
  2012-05-29 19:50                                           ` Yann Dupont
@ 2012-05-29 21:04                                           ` Stefan Priebe
  2012-05-29 21:08                                           ` Stefan Priebe
  2 siblings, 0 replies; 73+ messages in thread
From: Stefan Priebe @ 2012-05-29 21:04 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Yann Dupont, ceph-devel

Am 29.05.2012 19:50, schrieb Mark Nelson:
> Once I get these nodes up to 0.47 and get them switched over to 10GbE
> I'll redo the btrfs tests and try out xfs as well with longer running
> tests.
I always test on 1GE and see this proble no matter whether btrfs or xfs. 
So i think this is just a waste of time.

At least my test differ as i see this problem on ALL pools.

Mark should i try 0.46?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 17:50                                         ` Mark Nelson
  2012-05-29 19:50                                           ` Yann Dupont
  2012-05-29 21:04                                           ` Stefan Priebe
@ 2012-05-29 21:08                                           ` Stefan Priebe
  2012-05-29 21:31                                             ` Yann Dupont
  2012-05-29 21:41                                             ` Mark Nelson
  2 siblings, 2 replies; 73+ messages in thread
From: Stefan Priebe @ 2012-05-29 21:08 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Yann Dupont, ceph-devel

Am 29.05.2012 19:50, schrieb Mark Nelson:
> I did some quick tests on a couple of nodes I had laying around this
> morning.

I just noticed that i get a constant rate of 40MB/s while using 1 
thread. When i use two thread or more i get drop to 0MB/s and crazy 
jumping values.

~# rados -p rbd bench 90 write -t 1
Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1       1        10         9    35.994        36  0.100147  0.101133
     2       1        20        19   37.9931        40  0.096893  0.100719
     3       1        31        30   39.9921        44   0.09784 0.0999607
     4       1        41        40   39.9929        40  0.099156 0.0999003
     5       1        51        50   39.9932        40  0.098239 0.0996518
     6       1        61        60   39.9932        40  0.098682 0.0994851
     7       1        71        70   39.9933        40  0.094397  0.099184
     8       1        81        80   39.9931        40  0.099823 0.0993327
     9       1        91        90   39.9931        40  0.101013 0.0992236
    10       1       101       100    39.993        40  0.098277  0.099237



# rados -p rbd bench 90 write -t 2
Maintaining 2 concurrent writes of 4194304 bytes for at least 90 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1       2        15        13   51.9888        52    0.0956  0.115315
     2       2        22        20   39.9928        28  0.120065  0.193125
     3       2        41        39   51.9917        76   0.09557   0.15246
     4       2        58        56   55.9912        68   0.09875  0.137688
     5       2        67        65    51.992        36  0.111211  0.139465
     6       2        85        83   55.3251        72  0.136967  0.143079
     7       2       101        99   56.5625        64  0.098664  0.136263
     8       2       101        99   49.4919         0         -  0.136263
     9       2       112       110   48.8808        22  0.099479  0.160563

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 21:08                                           ` Stefan Priebe
@ 2012-05-29 21:31                                             ` Yann Dupont
  2012-05-29 21:34                                               ` Stefan Priebe
  2012-05-29 21:41                                             ` Mark Nelson
  1 sibling, 1 reply; 73+ messages in thread
From: Yann Dupont @ 2012-05-29 21:31 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Mark Nelson, ceph-devel

Le 29/05/2012 23:08, Stefan Priebe a écrit :
> Am 29.05.2012 19:50, schrieb Mark Nelson:
>> I did some quick tests on a couple of nodes I had laying around this
>> morning.
>
> I just noticed that i get a constant rate of 40MB/s while using 1 
> thread. When i use two thread or more i get drop to 0MB/s and crazy 
> jumping values.
>
> ~# rados -p rbd bench 90 write -t 1
> Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds.
>   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>     0       0         0         0         0         0         -         0
>     1       1        10         9    35.994        36  0.100147  0.101133
>     2       1        20        19   37.9931        40  0.096893  0.100719
>     3       1        31        30   39.9921        44   0.09784 0.0999607
>     4       1        41        40   39.9929        40  0.099156 0.0999003
>     5       1        51        50   39.9932        40  0.098239 0.0996518
>     6       1        61        60   39.9932        40  0.098682 0.0994851
>     7       1        71        70   39.9933        40  0.094397  0.099184
>     8       1        81        80   39.9931        40  0.099823 0.0993327
>     9       1        91        90   39.9931        40  0.101013 0.0992236
>    10       1       101       100    39.993        40  0.098277  0.099237
>
>

not here :

on data :
root@label5:~# rados -p data bench 20 write -t 1
Maintaining 1 concurrent writes of 4194304 bytes for at least 20 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1       1        15        14   55.9837        56  0.096813 0.0677311
     2       1        33        32   63.9852        72  0.088802 0.0612602
     3       1        51        50   66.6529        72  0.056883 0.0594909
     4       1        60        59    58.989        36  0.046377 0.0577145
     5       1        60        59   47.1916         0         - 0.0577145
     6       1        79        78   51.9911        38  0.041831 0.0768918
     7       1        98        97    55.419        76  0.050436 0.0718439
     8       1       101       100   49.9919        12  0.043673 0.0712079
     9       1       101       100   44.4375         0         - 0.0712079
    10       1       115       114   45.5929        28  0.043768 0.0876947
    11       1       134       133    48.356        76  0.052382 0.0826428
    12       1       154       153   50.9919        80  0.042077 0.0783619
    13       1       175       174   53.5299        84  0.053474 0.0745956
    14       1       194       193   55.1339        76  0.049631 0.0724711
    15       1       211       210    55.991        68  0.052683 0.0712887
    16       1       232       231   57.7407        84  0.044341 0.0692121
    17       1       249       248   58.3436        68  0.053707 0.0684414
    18       1       258       257    57.102        36  0.086088 0.0680656
    19       1       267       266   55.9911        36  0.050902 0.0713341
min lat: 0.033395 max lat: 2.14757 avg lat: 0.0703545
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20       1       285       284   56.7909        72  0.047755 0.0703545
Total time run:        20.066134
Total writes made:     286
Write size:            4194304
Bandwidth (MB/sec):    57.011

on rbd :


Maintaining 1 concurrent writes of 4194304 bytes for at least 20 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       1         1         0         0         0         -         0
     1       1        18        17   67.9801        68  0.065869 0.0587313
     2       1        35        34   67.9842        68  0.056982 0.0580468
     3       1        55        54   71.9848        80  0.050305 0.0554721
     4       1        72        71   70.9858        68  0.039387 0.0561269
     5       1        91        90    71.986        76  0.055236 0.0554057
     6       1       109       108   71.9864        72  0.069547 0.0554112
     7       1       126       125   71.4154        68  0.049234 0.0556564
     8       1       146       145   72.4868        80  0.052302 0.0551064
     9       1       165       164   72.8758        76    0.0533 0.0548858
    10       1       184       183    73.187        76  0.041342 0.0543598
    11       1       202       201    73.078        72  0.048963 0.0544978
    12       1       218       217   72.3207        64  0.071926 0.0549402
    13       1       236       235   72.2951        72  0.055804 0.0551936
    14       1       254       253   72.2731        72  0.058315 0.0552612
    15       1       272       271   72.2541        72  0.047687 0.0552036
    16       1       290       289   72.2375        72  0.059162  0.055275
    17       1       308       307   72.2229        72  0.051991 0.0553467
    18       1       327       326    72.432        76  0.053271 0.0552114
    19       1       346       345   72.6192        76  0.058125 0.0550658
min lat: 0.036202 max lat: 0.113077 avg lat: 0.0547502
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20       1       366       365   72.9874        80  0.036246 0.0547502
Total time run:        20.086555
Total writes made:     367
Write size:            4194304
Bandwidth (MB/sec):    73.084

>
> # rados -p rbd bench 90 write -t 2
> Maintaining 2 concurrent writes of 4194304 bytes for at least 90 seconds.
>   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>     0       0         0         0         0         0         -         0
>     1       2        15        13   51.9888        52    0.0956  0.115315
>     2       2        22        20   39.9928        28  0.120065  0.193125
>     3       2        41        39   51.9917        76   0.09557   0.15246
>     4       2        58        56   55.9912        68   0.09875  0.137688
>     5       2        67        65    51.992        36  0.111211  0.139465
>     6       2        85        83   55.3251        72  0.136967  0.143079
>     7       2       101        99   56.5625        64  0.098664  0.136263
>     8       2       101        99   49.4919         0         -  0.136263
>     9       2       112       110   48.8808        22  0.099479  0.160563
>
> Stefan

pool rbd stays consistent here, no matter how much thread involved. The 
max speed with my setup is around 16~24 threads, and it's quite effective.

on the contrary, pool data is jumping up & down, no matter how much 
thread involved :)

Maybe this is because journal is too tight ? Or because 2 of the 8 nodes 
have slower disks ?

I may be able to retest thursday, my two last osd should have faster & 
larger disks.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 21:31                                             ` Yann Dupont
@ 2012-05-29 21:34                                               ` Stefan Priebe
  2012-05-29 21:45                                                 ` Yann Dupont
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-29 21:34 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Mark Nelson, ceph-devel

Am 29.05.2012 23:31, schrieb Yann Dupont:
> on the contrary, pool data is jumping up & down, no matter how much
> thread involved :)
>
> Maybe this is because journal is too tight ? Or because 2 of the 8 nodes
> have slower disks ?
Can you try with 3.0.X? I would be really interested what happens in 
this case.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 21:08                                           ` Stefan Priebe
  2012-05-29 21:31                                             ` Yann Dupont
@ 2012-05-29 21:41                                             ` Mark Nelson
  2012-05-30  6:22                                               ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-29 21:41 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Yann Dupont, ceph-devel

On 05/29/2012 04:08 PM, Stefan Priebe wrote:
> Am 29.05.2012 19:50, schrieb Mark Nelson:
>> I did some quick tests on a couple of nodes I had laying around this
>> morning.
>
> I just noticed that i get a constant rate of 40MB/s while using 1 
> thread. When i use two thread or more i get drop to 0MB/s and crazy 
> jumping values.
>
> ~# rados -p rbd bench 90 write -t 1
> Maintaining 1 concurrent writes of 4194304 bytes for at least 90 seconds.
>   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>     0       0         0         0         0         0         -         0
>     1       1        10         9    35.994        36  0.100147  0.101133
>     2       1        20        19   37.9931        40  0.096893  0.100719
>     3       1        31        30   39.9921        44   0.09784 0.0999607
>     4       1        41        40   39.9929        40  0.099156 0.0999003
>     5       1        51        50   39.9932        40  0.098239 0.0996518
>     6       1        61        60   39.9932        40  0.098682 0.0994851
>     7       1        71        70   39.9933        40  0.094397  0.099184
>     8       1        81        80   39.9931        40  0.099823 0.0993327
>     9       1        91        90   39.9931        40  0.101013 0.0992236
>    10       1       101       100    39.993        40  0.098277  0.099237
>
>

When you are using 1 thread, you are hitting a ~40MB/s limit (probably 
networking related) before the data gets to the journal.  Because (in 
this case) the filestore data disk can handle that throughput, 
everything looks nice and consistent.

>
> # rados -p rbd bench 90 write -t 2
> Maintaining 2 concurrent writes of 4194304 bytes for at least 90 seconds.
>   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>     0       0         0         0         0         0         -         0
>     1       2        15        13   51.9888        52    0.0956  0.115315
>     2       2        22        20   39.9928        28  0.120065  0.193125
>     3       2        41        39   51.9917        76   0.09557   0.15246
>     4       2        58        56   55.9912        68   0.09875  0.137688
>     5       2        67        65    51.992        36  0.111211  0.139465
>     6       2        85        83   55.3251        72  0.136967  0.143079
>     7       2       101        99   56.5625        64  0.098664  0.136263
>     8       2       101        99   49.4919         0         -  0.136263
>     9       2       112       110   48.8808        22  0.099479  0.160563
>

In this case, that 40MB/s limit with 1 thread has increased.  Now more 
data is getting fed into the journal than the filestore can write out to 
disk.  Eventually writes stall while the data is being written out.
> Stefan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 21:34                                               ` Stefan Priebe
@ 2012-05-29 21:45                                                 ` Yann Dupont
  2012-05-30  6:29                                                   ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Yann Dupont @ 2012-05-29 21:45 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Mark Nelson, ceph-devel

Le 29/05/2012 23:34, Stefan Priebe a écrit :
> Am 29.05.2012 23:31, schrieb Yann Dupont:
>> on the contrary, pool data is jumping up & down, no matter how much
>> thread involved :)
>>
>> Maybe this is because journal is too tight ? Or because 2 of the 8 nodes
>> have slower disks ?
> Can you try with 3.0.X? I would be really interested what happens in 
> this case.
>
> Stefan
hum...
probably not directly. Older btrfs won't like big metadata, I think. 
This is quite a recent feature.

but as my ceph is not in production, I can stop it, use an older kernel 
, format new volumes in btrfs or xfs, or whatever, and try.

It will be a totally fresh install then.

I can do that thursday.

Stefan, mark,
I'll take the latest 3.0 kernel - or do you have a particular 3.0 kernel 
version to test ?
And Do you want a particular xfs/btrfs format ?

cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-24 14:10 poor OSD performance using kernel 3.4 Stefan Priebe - Profihost AG
  2012-05-24 14:57 ` Mark Nelson
       [not found] ` <CAJCPpW+SKnnVUaDEAsCkKyZwMVrHCRJF2C8zqB4eORgwW5p=1Q@mail.gmail.com>
@ 2012-05-29 22:25 ` Mark Nelson
  2012-05-30  6:33   ` Stefan Priebe - Profihost AG
  2 siblings, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-29 22:25 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

On 05/24/2012 09:10 AM, Stefan Priebe - Profihost AG wrote:
> Hi list,
>
> today while testing btrfs i discovered a very poor osd performance using
> kernel 3.4.
>
> Underlying FS is XFS but it is the same with btrfs.
>
> 3.0.30:
> ~# rados -p data bench 10 write -t 16
> Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1      16        41        25   99.9767       100  0.586984  0.447293
>      2      16        71        55   109.979       120  0.934388  0.488375
>      3      16        99        83   110.647       112   1.15982  0.503111
>      4      16       130       114   113.981       124   1.05952  0.516925
>      5      16       159       143   114.382       116  0.149313  0.510734
>      6      16       188       172   114.649       116  0.287166   0.52203
>      7      16       215       199   113.697       108  0.151784  0.531461
>      8      16       242       226   112.984       108  0.623478  0.539896
>      9      16       265       249   110.651        92   0.50354  0.538504
>     10      16       296       280   111.984       124  0.155048  0.542846
> Total time run:        10.776153
> Total writes made:     297
> Write size:            4194304
> Bandwidth (MB/sec):    110.243
>
> Average Latency:       0.577534
> Max latency:           1.85499
> Min latency:           0.091473
>
>
> 3.4:
> ~# rados -p data bench 10 write -t 16
> Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1      16        40        24   95.9794        96  0.393196  0.455936
>      2      16        68        52   103.983       112  0.835652  0.517297
>      3      16        85        69   91.9849        68   1.00535  0.493058
>      4      16        96        80   79.9869        44  0.096564  0.577948
>      5      16       103        87   69.5879        28  0.092722  0.589147
>      6      16       117       101   67.3216        56  0.222175  0.675334
>      7      16       130       114   65.1321        52   0.15677  0.623806
>      8      16       144       128   63.9896        56  0.089157   0.56746
>      9      16       144       128   56.8794         0         -   0.56746
>     10      16       144       128   51.1912         0         -   0.56746
>     11      16       144       128   46.5373         0         -   0.56746
>     12      16       144       128   42.6591         0         -   0.56746
>     13      16       144       128   39.3776         0         -   0.56746
>     14      16       144       128   36.5649         0         -   0.56746
>     15      16       144       128   34.1272         0         -   0.56746
>     16      16       145       129   32.2443       0.5   11.3422  0.650985
> Total time run:        16.193871
> Total writes made:     145
> Write size:            4194304
> Bandwidth (MB/sec):    35.816
>
> Average Latency:       1.78467
> Max latency:           14.4744
> Min latency:           0.088753
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I setup some tests today to try to replicate your findings (and also 
check results against some previous ones I've done).  I don't think I'm 
seeing exactly the same results as you, but I definitely see xfs 
performing worse in this specific test than btrfs.  I've included the 
results here.

Distro: Ubuntu Oneiric (IE no syncfs in glibc)
Ceph: 0.47.2
Kernel 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64)
Network: 10GbE

1 Client node
3 Mon nodes
2 OSD nodes with 1 OSD each mounted on a 7200rpm SAS drive.  H700 Raid 
controller with each drive in a 1 disk raid0.  Journals are partitioned 
on a separate drive.  OSD data disks are using WT cache while journals 
are using WB.
btrfs created with -l 64k -n64k, mounted using noatime.
xfs created with -f -d su=64k,sw=1 -i size=2048, mounted using noatime.
rados bench invocation: rados -p data bench 300 write -t 16 -b 4194304

btrfs:

Total time run:        300.413696
Total writes made:     7582
Write size:            4194304
Bandwidth (MB/sec):    100.954

Average Latency:       0.633932
Max latency:           3.78661
Min latency:           0.065734

xfs:

Total time run:        304.435966
Total writes made:     5023
Write size:            4194304
Bandwidth (MB/sec):    65.997

Average Latency:       0.96965
Max latency:           36.4993
Min latency:           0.07516

Full results are available here:

http://nhm.ceph.com/results/mailinglist-tests/

I created seekwatcher movies by running blktrace on the underlying OSD 
data disks during the tests.  These show throughput over time, 
seeks/sec, and visual representation of where the disk is being written 
to for each OSD.  You can see them here:

http://nhm.ceph.com/movies/mailinglist-tests/

As you can see, at least for the quick tests I did this afternoon, the 
performance of the underlying OSD disk is highly correlated with the 
number of seeks being done.  These results may improve with syncfs 
support in Ubuntu 12.04.  If you have your journals on the same disks as 
the OSDs, that will cause even more seeks (in addition to the additional 
to the greater throughput demands).  These are things that we are 
actively investigating and hopefully will be able to improve over the 
coming months.

Thanks,
Mark


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 21:41                                             ` Mark Nelson
@ 2012-05-30  6:22                                               ` Stefan Priebe - Profihost AG
  2012-05-30  7:20                                                 ` building test cluster : missing /etc/ceph/client.admin.keyring, need help Alexandre DERUMIER
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30  6:22 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Yann Dupont, ceph-devel

Am 29.05.2012 23:41, schrieb Mark Nelson:

> When you are using 1 thread, you are hitting a ~40MB/s limit (probably
> networking related) before the data gets to the journal.
1GB/s is capable of at least 130Mb/s and i get 130MB/s with 3.0.30 using
16 threads. I don't get why i should hit a limit here.

> Because (in
> this case) the filestore data disk can handle that throughput,
> everything looks nice and consistent.
osd bench and fio and dd tells me the underlying disks can handle
260MB/s (Intel SSD).

> In this case, that 40MB/s limit with 1 thread has increased.  Now more
> data is getting fed into the journal than the filestore can write out to
> disk.  Eventually writes stall while the data is being written out.

I don't want to argue but why should this only happen with 3.4.0 and NOT
with 3.0.30. Even though it does not matter which underlying FS i use.
It is the same with XFS AND btrfs.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 21:45                                                 ` Yann Dupont
@ 2012-05-30  6:29                                                   ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30  6:29 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Mark Nelson, ceph-devel

Am 29.05.2012 23:45, schrieb Yann Dupont:
> Le 29/05/2012 23:34, Stefan Priebe a écrit :
>> Am 29.05.2012 23:31, schrieb Yann Dupont:
>>> on the contrary, pool data is jumping up & down, no matter how much
>>> thread involved :)
>>>
>>> Maybe this is because journal is too tight ? Or because 2 of the 8 nodes
>>> have slower disks ?
>> Can you try with 3.0.X? I would be really interested what happens in
>> this case.
>>
>> Stefan
> hum...
> probably not directly. Older btrfs won't like big metadata, I think.
> This is quite a recent feature.
That's absolutely correct. If you test 3.0.X i think its better to use
XFS. I'm just interested if the problem we both see is gone for you too
with 3.0.X.

> I'll take the latest 3.0 kernel - or do you have a particular 3.0 kernel
> version to test ?
I've used the latest 3.0.X stable (.32 right now)

> And Do you want a particular xfs/btrfs format ?
mkfs.xfs is enough ;-)

Thanks!

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-29 22:25 ` poor OSD performance using kernel 3.4 Mark Nelson
@ 2012-05-30  6:33   ` Stefan Priebe - Profihost AG
       [not found]     ` <CADdPHGs9dpSh9Oyu+5yDhyYU=Et_-zF5MuYybBuuAN5DgR433A@mail.gmail.com>
  2012-05-30 11:51     ` poor OSD performance using kernel 3.4 Mark Nelson
  0 siblings, 2 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30  6:33 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel@vger.kernel.org

> 
> I setup some tests today to try to replicate your findings (and also
> check results against some previous ones I've done).  I don't think I'm
> seeing exactly the same results as you, but I definitely see xfs
> performing worse in this specific test than btrfs.  I've included the
> results here.
>
> Full results are available here:
> http://nhm.ceph.com/results/mailinglist-tests/

But these tests shows exactly he same bad behaviour i'm seeing. Instead
of having a constant sequential write ratio you've heavily jumping
values. Are you able to test with XFS and 3.0.32? You'll then probably
see an absolutely constant write ratio.

Greets,
Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
       [not found]     ` <CADdPHGs9dpSh9Oyu+5yDhyYU=Et_-zF5MuYybBuuAN5DgR433A@mail.gmail.com>
@ 2012-05-30  7:16       ` Stefan Priebe - Profihost AG
       [not found]         ` <CADdPHGuiJqZUCK-0qR_CrOo6GRhkjaCdkOhJ2boq3zD0_voTsA@mail.gmail.com>
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30  7:16 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

Am 30.05.2012 09:01, schrieb Stefan Majer:
> Hi Stefan,
> 
> what is your replication factor ? If it set to 2 and your osds have a
> single 1GB/sec link you never will see more than 120MB/sec i suspect
> much less because every write have to go to the same wire twice from
> each osd.
Sure - but right now i see 10MB/s with kernel 3.4 and 170MB/s with
3.0.30 using bonded 2x 1Gbit/s links.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* building test cluster : missing /etc/ceph/client.admin.keyring, need help
  2012-05-30  6:22                                               ` Stefan Priebe - Profihost AG
@ 2012-05-30  7:20                                                 ` Alexandre DERUMIER
  2012-05-30  7:25                                                   ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-30  7:20 UTC (permalink / raw)
  To: ceph-devel


Hi, 
I'm building my rados test cluster, 


3 servers,with on each server : 1 mon - 5 osd

mon daemon and osd are started, but when i use ceph command, it's missing client.admin.keyring

root@cephtest1:/etc/ceph# ceph -w
2012-05-30 09:05:35.255619 7fd1e9cfa760 -1 auth: failed to open keyring from /etc/ceph/client.admin.keyring
2012-05-30 09:05:35.255631 7fd1e9cfa760 -1 monclient(hunting): failed to open keyring: (2) No such file or directory
2012-05-30 09:05:35.255693 7fd1e9cfa760 -1 ceph_tool_common_init failed.


root@cephtest1:/etc/ceph# ls /etc/ceph/
ceph.conf  osd.0.keyring  osd.1.keyring  osd.2.keyring  osd.3.keyring  osd.4.keyring

Do I need to generate a keyring ? how can I do it ? 






/etc/ceph.conf 


[global] 
; use cephx or none 
auth supported = cephx 
keyring = /etc/ceph/$name.keyring 


[mon] 
mon data = /srv/mon.$id 


[mds] 


[osd] 
osd data = /srv/osd.$id 
osd journal = /srv/osd.$id.journal 
osd journal size = 1000 
; uncomment the following line if you are mounting with ext4 
; filestore xattr use omap = true 


[mon.a] 
host = cephtest1 
mon addr = 10.3.94.27:6789 


[mon.b] 
host = cephtest2 
mon addr = 10.3.94.28:6789 


[mon.c] 
host = cephtest3 
mon addr = 10.3.94.29:6789 


[osd.0] 
host = cephtest1 
addr = 10.3.94.27 


[osd.1] 
host = cephtest1 
addr = 10.3.94.27 


[osd.2] 
host = cephtest1 
addr = 10.3.94.27 


[osd.3] 
host = cephtest1 
addr = 10.3.94.27 


[osd.4] 
host = cephtest1 
addr = 10.3.94.27 


[osd.5] 
host = cephtest2 
addr = 10.3.94.28 


[osd.6] 
host = cephtest2 
addr = 10.3.94.28 


[osd.7] 
host = cephtest2 
addr = 10.3.94.28 


[osd.8] 
host = cephtest2 
addr = 10.3.94.28 


[osd.9] 
host = cephtest2 
addr = 10.3.94.28 


[osd.10] 
host = cephtest3 
addr = 10.3.94.29 

[osd.11] 
host = cephtest3 
addr = 10.3.94.29 


[osd.12] 
host = cephtest3 
addr = 10.3.94.29 


[osd.13] 
host = cephtest3 
addr = 10.3.94.29 


[osd.14] 
host = cephtest3 
addr = 10.3.94.29 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help
  2012-05-30  7:20                                                 ` building test cluster : missing /etc/ceph/client.admin.keyring, need help Alexandre DERUMIER
@ 2012-05-30  7:25                                                   ` Stefan Priebe - Profihost AG
  2012-05-30  7:33                                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30  7:25 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel

Am 30.05.2012 09:20, schrieb Alexandre DERUMIER:
> 
> Hi, 
> I'm building my rados test cluster, 
> 
> 
> 3 servers,with on each server : 1 mon - 5 osd
> 
> mon daemon and osd are started, but when i use ceph command, it's missing client.admin.keyring
> 
> root@cephtest1:/etc/ceph# ceph -w
> 2012-05-30 09:05:35.255619 7fd1e9cfa760 -1 auth: failed to open keyring from /etc/ceph/client.admin.keyring
> 2012-05-30 09:05:35.255631 7fd1e9cfa760 -1 monclient(hunting): failed to open keyring: (2) No such file or directory
> 2012-05-30 09:05:35.255693 7fd1e9cfa760 -1 ceph_tool_common_init failed.
Just run:
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/client.admin.keyring

and it will create the admin key for you.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help
  2012-05-30  7:25                                                   ` Stefan Priebe - Profihost AG
@ 2012-05-30  7:33                                                     ` Alexandre DERUMIER
  2012-05-30  7:47                                                       ` Alexandre DERUMIER
  0 siblings, 1 reply; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-30  7:33 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

ok ,thanks

I had created the cluster, following the official doc
http://ceph.com/docs/master/config-cluster/deploying-ceph-with-mkcephfs/
with

mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring

and file was created in /srv

# cat /srv/ceph.keyring 
[client.admin]
        key = AQCQwcVPGIAwHhAAuS5Veg7GoOyzh59zq2TKag==


is it an error in documentation ?


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 30 Mai 2012 09:25:56 
Objet: Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help 

Am 30.05.2012 09:20, schrieb Alexandre DERUMIER: 
> 
> Hi, 
> I'm building my rados test cluster, 
> 
> 
> 3 servers,with on each server : 1 mon - 5 osd 
> 
> mon daemon and osd are started, but when i use ceph command, it's missing client.admin.keyring 
> 
> root@cephtest1:/etc/ceph# ceph -w 
> 2012-05-30 09:05:35.255619 7fd1e9cfa760 -1 auth: failed to open keyring from /etc/ceph/client.admin.keyring 
> 2012-05-30 09:05:35.255631 7fd1e9cfa760 -1 monclient(hunting): failed to open keyring: (2) No such file or directory 
> 2012-05-30 09:05:35.255693 7fd1e9cfa760 -1 ceph_tool_common_init failed. 
Just run: 
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/client.admin.keyring 

and it will create the admin key for you. 

Stefan 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help
  2012-05-30  7:33                                                     ` Alexandre DERUMIER
@ 2012-05-30  7:47                                                       ` Alexandre DERUMIER
  0 siblings, 0 replies; 73+ messages in thread
From: Alexandre DERUMIER @ 2012-05-30  7:47 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

root@cephtest1:/srv# cp /srv/ceph.keyring /etc/ceph/client.admin.keyring
root@cephtest1:/srv# ceph -w
2012-05-30 09:26:40.336175    pg v572: 2880 pgs: 2880 active+clean; 0 bytes data, 544 MB used, 2039 GB / 2039 GB avail
2012-05-30 09:26:40.342175   mds e1: 0/0/1 up
2012-05-30 09:26:40.342207   osd e17: 15 osds: 15 up, 15 in
2012-05-30 09:26:40.342331   log 2012-05-30 09:06:35.419340 osd.9 10.3.94.28:6812/13794 260 : [INF] 2.3bb scrub ok
2012-05-30 09:26:40.342424   mon e1: 3 mons at {a=10.3.94.27:6789/0,b=10.3.94.28:6789/0,c=10.3.94.29:6789/0}

Ok, the fun will begin now :)


----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 30 Mai 2012 09:33:40 
Objet: Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help 

ok ,thanks 

I had created the cluster, following the official doc 
http://ceph.com/docs/master/config-cluster/deploying-ceph-with-mkcephfs/ 
with 

mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring 

and file was created in /srv 

# cat /srv/ceph.keyring 
[client.admin] 
key = AQCQwcVPGIAwHhAAuS5Veg7GoOyzh59zq2TKag== 


is it an error in documentation ? 


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 30 Mai 2012 09:25:56 
Objet: Re: building test cluster : missing /etc/ceph/client.admin.keyring, need help 

Am 30.05.2012 09:20, schrieb Alexandre DERUMIER: 
> 
> Hi, 
> I'm building my rados test cluster, 
> 
> 
> 3 servers,with on each server : 1 mon - 5 osd 
> 
> mon daemon and osd are started, but when i use ceph command, it's missing client.admin.keyring 
> 
> root@cephtest1:/etc/ceph# ceph -w 
> 2012-05-30 09:05:35.255619 7fd1e9cfa760 -1 auth: failed to open keyring from /etc/ceph/client.admin.keyring 
> 2012-05-30 09:05:35.255631 7fd1e9cfa760 -1 monclient(hunting): failed to open keyring: (2) No such file or directory 
> 2012-05-30 09:05:35.255693 7fd1e9cfa760 -1 ceph_tool_common_init failed. 
Just run: 
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/client.admin.keyring 

and it will create the admin key for you. 

Stefan 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 




	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 
	
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
       [not found]         ` <CADdPHGuiJqZUCK-0qR_CrOo6GRhkjaCdkOhJ2boq3zD0_voTsA@mail.gmail.com>
@ 2012-05-30 11:04           ` Stefan Priebe - Profihost AG
       [not found]             ` <CADdPHGuLAL5+hkzq0tigqu355DvPxkhE5sxBhOVZPj=EzDSVtA@mail.gmail.com>
  2012-05-30 12:17             ` Mark Nelson
  0 siblings, 2 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30 11:04 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

Am 30.05.2012 09:19, schrieb Stefan Majer:
> Hi,
> 
> ok, so your replication level is 2 and you have 2*1GB/sec right ?
Generally yes - but for this new test it was just 1*1GB/s (see below).

> do you have a iostat -x 3 output and or a dstat from all effected
> machines during your rados bench runs as well ?

As the output looks exactly the same on all OSDs here is it from ONE osd:

Kernel 3.4:
http://pastebin.com/raw.php?i=sV9vKsWy

Kernel 3.0:
http://pastebin.com/raw.php?i=eafjpPpK

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
       [not found]             ` <CADdPHGuLAL5+hkzq0tigqu355DvPxkhE5sxBhOVZPj=EzDSVtA@mail.gmail.com>
@ 2012-05-30 11:25               ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30 11:25 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

Am 30.05.2012 13:20, schrieb Stefan Majer:
> H,
> 
> 
> On Wed, May 30, 2012 at 1:04 PM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag <mailto:s.priebe@profihost.ag>> wrote:
> 
>     Am 30.05.2012 09:19, schrieb Stefan Majer:
>     > Hi,
>     >
>     > ok, so your replication level is 2 and you have 2*1GB/sec right ?
>     Generally yes - but for this new test it was just 1*1GB/s (see below).
> 
>     > do you have a iostat -x 3 output and or a dstat from all effected
>     > machines during your rados bench runs as well ?
> 
>     As the output looks exactly the same on all OSDs here is it from ONE
>     osd:
> 
>     Kernel 3.4:
>     http://pastebin.com/raw.php?i=sV9vKsWy
> 
> This is strange, looks like a real regression in 3.4 ? but i guess it is
> only possible to track down this by doing 
> git bisect on the kernel sources :-(
I also tried 3.3 and 3.2 it's the same... (haven't tested 3.1).

>     Kernel 3.0:
>     http://pastebin.com/raw.php?i=eafjpPpK
> 
> 
> Here you can see a constant rate to disk of ~ 50 - 70Mbyte/sec with
> about 10-15% utilization on  them. So this test is not disk bound i
> guess your network is saturated. Can you run dstat during this test as
> well to see the network bandwith used as well.
Absolutely correct. I'm aware of this. I just want to have this result
with 3.4 so that i can use btrfs.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30  6:33   ` Stefan Priebe - Profihost AG
       [not found]     ` <CADdPHGs9dpSh9Oyu+5yDhyYU=Et_-zF5MuYybBuuAN5DgR433A@mail.gmail.com>
@ 2012-05-30 11:51     ` Mark Nelson
  1 sibling, 0 replies; 73+ messages in thread
From: Mark Nelson @ 2012-05-30 11:51 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel@vger.kernel.org

On 05/30/2012 01:33 AM, Stefan Priebe - Profihost AG wrote:
>> I setup some tests today to try to replicate your findings (and also
>> check results against some previous ones I've done).  I don't think I'm
>> seeing exactly the same results as you, but I definitely see xfs
>> performing worse in this specific test than btrfs.  I've included the
>> results here.
>>
>> Full results are available here:
>> http://nhm.ceph.com/results/mailinglist-tests/
> But these tests shows exactly he same bad behaviour i'm seeing. Instead
> of having a constant sequential write ratio you've heavily jumping
> values. Are you able to test with XFS and 3.0.32? You'll then probably
> see an absolutely constant write ratio.
>
> Greets,
> Stefan

The jumping around is due to the writes to the underlying OSD disk not 
being able to keep up with the journal.  I think it's more a symptom of 
the problem rather than the problem itself.  Presumably the OSD data 
disk is performing slowly because of the number of seeks that are 
happening (In my tests almost always between 40-60 on XFS, and growing 
over time on btrfs).  It's entirely possible that something changed 
going from 3.0 to 3.4 that is causing the seek behavior to be worse.  
I'll try the test again on a 3.0 kernel and record seekwatcher results 
to see if the write patterns look any different.

Btw, I apologize if you mentioned this already, but are you running MONs 
on the OSD nodes?  Also, what version of glibc do you have?

Thanks,
Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 11:04           ` Stefan Priebe - Profihost AG
       [not found]             ` <CADdPHGuLAL5+hkzq0tigqu355DvPxkhE5sxBhOVZPj=EzDSVtA@mail.gmail.com>
@ 2012-05-30 12:17             ` Mark Nelson
  2012-05-30 12:41               ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-30 12:17 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

On 05/30/2012 06:04 AM, Stefan Priebe - Profihost AG wrote:
> Am 30.05.2012 09:19, schrieb Stefan Majer:
>> Hi,
>>
>> ok, so your replication level is 2 and you have 2*1GB/sec right ?
> Generally yes - but for this new test it was just 1*1GB/s (see below).
>
>> do you have a iostat -x 3 output and or a dstat from all effected
>> machines during your rados bench runs as well ?
> As the output looks exactly the same on all OSDs here is it from ONE osd:
>
> Kernel 3.4:
> http://pastebin.com/raw.php?i=sV9vKsWy
>
> Kernel 3.0:
> http://pastebin.com/raw.php?i=eafjpPpK
>
> Stefan

Would you mind installing blktrace and running "blktrace -o test-3.4 -d 
/dev/sdb" on the OSD node during a short (say 60s) test on 3.4?

If you could archive/send me the results, that might help us get an idea 
of what is actually getting sent out to the disk.  Your data disk 
throughput on 3.0 looks pretty close to what I normally get (including 
on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not 
the seek problem I mentioned earlier (unless something is causing so 
many seeks that it more or less paralyzes the disk).

Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 12:17             ` Mark Nelson
@ 2012-05-30 12:41               ` Stefan Priebe - Profihost AG
       [not found]                 ` <CADdPHGsmr8Ht1pTWH1Oe8=NmAyM81SSdH+c_GV89D8ntfyUmgA@mail.gmail.com>
                                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30 12:41 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

Hi Mark,

didn't had the time to answer your mails - but i will get on this one first.

> Would you mind installing blktrace and running "blktrace -o test-3.4 -d
> /dev/sdb" on the OSD node during a short (say 60s) test on 3.4?
sure no problem.

here it is:
http://www.mediafire.com/?6cw87btn7mzco25

Output:
=== sdb ===
  CPU  0:                18075 events,      848 KiB data
  CPU  1:                10738 events,      504 KiB data
  CPU  2:                 8639 events,      405 KiB data
  CPU  3:                 8614 events,      404 KiB data
  CPU  4:                    0 events,        0 KiB data
  CPU  5:                    0 events,        0 KiB data
  CPU  6:                  143 events,        7 KiB data
  CPU  7:                    0 events,        0 KiB data
  Total:                 46209 events (dropped 0),     2167 KiB data

> If you could archive/send me the results, that might help us get an idea
> of what is actually getting sent out to the disk.  Your data disk
> throughput on 3.0 looks pretty close to what I normally get (including
> on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not
> the seek problem I mentioned earlier (unless something is causing so
> many seeks that it more or less paralyzes the disk).
As i have a SSD i can't believe seeks can be a problem.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
       [not found]                 ` <CADdPHGsmr8Ht1pTWH1Oe8=NmAyM81SSdH+c_GV89D8ntfyUmgA@mail.gmail.com>
@ 2012-05-30 13:19                   ` Stefan Priebe - Profihost AG
       [not found]                     ` <CADdPHGvxCmuViy+0==Vkdz_QjC1K+kD5kD1m7+0tYM2YDTtJbw@mail.gmail.com>
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30 13:19 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

Am 30.05.2012 15:11, schrieb Stefan Majer:
> Hi,
> 
> I dont think seeks are a problem, because Stefan would see huge disk
> util percentage with iostat which is not the case.
> I guess the problem with 3.2 and greater is somewhere else for example
> in a network card driver which changed dramaticalle or something like that.

Can't beliebe that the e1000 driver is now so buggy - although iperf
still shows me around 950MBit/s no matter if 3.0, 3.2, ...

> Is it possible to to a git bisect, on that machine and do some runs,
> otherwise i see no point how to identify this.
I'm not familiar with git bisect so i can't answer this question

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 12:41               ` Stefan Priebe - Profihost AG
       [not found]                 ` <CADdPHGsmr8Ht1pTWH1Oe8=NmAyM81SSdH+c_GV89D8ntfyUmgA@mail.gmail.com>
@ 2012-05-30 13:27                 ` Mark Nelson
  2012-05-30 13:51                   ` Stefan Priebe - Profihost AG
  2012-05-30 14:16                 ` Mark Nelson
  2 siblings, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-30 13:27 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

On 5/30/12 7:41 AM, Stefan Priebe - Profihost AG wrote:
> Hi Mark,
>
> didn't had the time to answer your mails - but i will get on this one first.
>
>> Would you mind installing blktrace and running "blktrace -o test-3.4 -d
>> /dev/sdb" on the OSD node during a short (say 60s) test on 3.4?
> sure no problem.
>
> here it is:
> http://www.mediafire.com/?6cw87btn7mzco25
>
> Output:
> === sdb ===
>    CPU  0:                18075 events,      848 KiB data
>    CPU  1:                10738 events,      504 KiB data
>    CPU  2:                 8639 events,      405 KiB data
>    CPU  3:                 8614 events,      404 KiB data
>    CPU  4:                    0 events,        0 KiB data
>    CPU  5:                    0 events,        0 KiB data
>    CPU  6:                  143 events,        7 KiB data
>    CPU  7:                    0 events,        0 KiB data
>    Total:                 46209 events (dropped 0),     2167 KiB data

Great, thanks.  I'll try to look at the results later this morning.  If 
you want to look at them yourself you can open them with the blkparse 
program (and seekwatcher too, though there is a bug in the src you have 
to fix to make it work right)

>> If you could archive/send me the results, that might help us get an idea
>> of what is actually getting sent out to the disk.  Your data disk
>> throughput on 3.0 looks pretty close to what I normally get (including
>> on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not
>> the seek problem I mentioned earlier (unless something is causing so
>> many seeks that it more or less paralyzes the disk).
> As i have a SSD i can't believe seeks can be a problem.

Ah, sorry. I  forgot you were on SSD.  Honestly I'm surpised that with 
3.0 you weren't getting better performance.  Something to look into once 
we figure out why your 3.4 performance is so bad!
> Stefan


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 13:27                 ` Mark Nelson
@ 2012-05-30 13:51                   ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30 13:51 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

Am 30.05.2012 15:27, schrieb Mark Nelson:
> Great, thanks.  I'll try to look at the results later this morning.  If
> you want to look at them yourself you can open them with the blkparse
> program (and seekwatcher too, though there is a bug in the src you have
> to fix to make it work right)
I've no idea about blkparse and seekwatcher - so i don't know what i
should do with the output...

>>> If you could archive/send me the results, that might help us get an idea
>>> of what is actually getting sent out to the disk.  Your data disk
>>> throughput on 3.0 looks pretty close to what I normally get (including
>>> on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not
>>> the seek problem I mentioned earlier (unless something is causing so
>>> many seeks that it more or less paralyzes the disk).
>> As i have a SSD i can't believe seeks can be a problem.
> 
> Ah, sorry. I  forgot you were on SSD.  Honestly I'm surpised that with
> 3.0 you weren't getting better performance.  Something to look into once
> we figure out why your 3.4 performance is so bad!
Yes i think this is another problem.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
       [not found]                     ` <CADdPHGvxCmuViy+0==Vkdz_QjC1K+kD5kD1m7+0tYM2YDTtJbw@mail.gmail.com>
@ 2012-05-30 13:54                       ` Stefan Priebe - Profihost AG
       [not found]                       ` <4FC63381.6090300@inktank.com>
  1 sibling, 0 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-30 13:54 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

Am 30.05.2012 15:38, schrieb Stefan Majer:
> There is a small howto from linus:
> http://kerneltrap.org/node/11753
> 
> you basically need to be able to compile the kernel from source and
> start in the freshly checked out source 
> git bisect good v3.0
> git bisect bad v3.2
>  
> Then git will pick a version inbetween an you can compile this, depoy it
> to your machine an look if it good or bad.
> Then tell git if it was bad or good and git will again choose a version
> between both versions. So you will get a single commit or a handful of
> commits which are probably the cause of the problem.
Thanks will try that after mark has looked into the blktrace ;-)

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 12:41               ` Stefan Priebe - Profihost AG
       [not found]                 ` <CADdPHGsmr8Ht1pTWH1Oe8=NmAyM81SSdH+c_GV89D8ntfyUmgA@mail.gmail.com>
  2012-05-30 13:27                 ` Mark Nelson
@ 2012-05-30 14:16                 ` Mark Nelson
  2012-05-30 18:42                   ` Stefan Priebe
  2 siblings, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-30 14:16 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

On 5/30/12 7:41 AM, Stefan Priebe - Profihost AG wrote:
> Hi Mark,
>
> didn't had the time to answer your mails - but i will get on this one first.
>
>> Would you mind installing blktrace and running "blktrace -o test-3.4 -d
>> /dev/sdb" on the OSD node during a short (say 60s) test on 3.4?
> sure no problem.
>
> here it is:
> http://www.mediafire.com/?6cw87btn7mzco25
>
> Output:
> === sdb ===
>    CPU  0:                18075 events,      848 KiB data
>    CPU  1:                10738 events,      504 KiB data
>    CPU  2:                 8639 events,      405 KiB data
>    CPU  3:                 8614 events,      404 KiB data
>    CPU  4:                    0 events,        0 KiB data
>    CPU  5:                    0 events,        0 KiB data
>    CPU  6:                  143 events,        7 KiB data
>    CPU  7:                    0 events,        0 KiB data
>    Total:                 46209 events (dropped 0),     2167 KiB data
>
>> If you could archive/send me the results, that might help us get an idea
>> of what is actually getting sent out to the disk.  Your data disk
>> throughput on 3.0 looks pretty close to what I normally get (including
>> on 3.4).  I'm guessing the issue you are seeing on 3.4 is probably not
>> the seek problem I mentioned earlier (unless something is causing so
>> many seeks that it more or less paralyzes the disk).
> As i have a SSD i can't believe seeks can be a problem.
>
> Stefan
Ok, I put up a seekwatcher movie showing the writes going to your SSD:

http://nhm.ceph.com/movies/mailinglist-tests/stefan.mpg

Some quick observations:

In your blktrace results there are some really big gaps after cfq 
schedule dispatch:

>   8,16   0        0    11.386025866     0  m   N cfq schedule dispatch
>   8,16   2      975    12.393446988  3074  A  WS 176147976 + 8 <- 
> (8,17) 176145928
>   8,16   0        0    12.762164080     0  m   N cfq schedule dispatch
>   8,16   0     2193    13.355165118  3312  A WSM 175875008 + 227 <- 
> (8,17) 175872960

Specifically, the gap in the movie where there is no write activity 
around second 30 correlates in the blktrace results with one of these 
stalls:
>   8,16   0        0    29.548567957     0  m   N cfq schedule dispatch
>   8,16   2     2185    34.548923918  2688  A   W 2192 + 8 <- (8,17) 144

As to why this is happening, I don't know yet.  I'll have more later.

Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
       [not found]                       ` <4FC63381.6090300@inktank.com>
@ 2012-05-30 14:53                         ` Stefan Priebe
  2012-05-30 14:56                           ` Mark Nelson
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-30 14:53 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

Am 30.05.2012 16:49, schrieb Mark Nelson:
> On 05/30/2012 08:38 AM, Stefan Majer wrote:
>> No i dont think so either, this was just a example. Maybe it is totaly
>> different.
>
> You could try setting up a pool with a replication level of 1 and see
> how that does. It will be faster in any event, but it would be
> interesting to see how much faster.
is there an easier way than modifying the crush map?

PS: i also tested noop scheduler - same result.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 14:53                         ` Stefan Priebe
@ 2012-05-30 14:56                           ` Mark Nelson
  2012-05-30 18:26                             ` Stefan Priebe
  0 siblings, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-30 14:56 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

On 05/30/2012 09:53 AM, Stefan Priebe wrote:
> Am 30.05.2012 16:49, schrieb Mark Nelson:
>> On 05/30/2012 08:38 AM, Stefan Majer wrote:
>>> No i dont think so either, this was just a example. Maybe it is totaly
>>> different.
>>
>> You could try setting up a pool with a replication level of 1 and see
>> how that does. It will be faster in any event, but it would be
>> interesting to see how much faster.
> is there an easier way than modifying the crush map?
>
> PS: i also tested noop scheduler - same result.
>
> Stefan

something like:

ceph osd pool create POOL [pg_num [pgp_num]]

then:

ceph osd pool set POOL size VALUE


Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 14:56                           ` Mark Nelson
@ 2012-05-30 18:26                             ` Stefan Priebe
  2012-05-30 19:41                               ` Mark Nelson
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-30 18:26 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

Hi Mark,

Am 30.05.2012 16:56, schrieb Mark Nelson:
> On 05/30/2012 09:53 AM, Stefan Priebe wrote:
>> Am 30.05.2012 16:49, schrieb Mark Nelson:
>>> You could try setting up a pool with a replication level of 1 and see
>>> how that does. It will be faster in any event, but it would be
>>> interesting to see how much faster.
>> is there an easier way than modifying the crush map?
 >
> something like:
> ceph osd pool create POOL [pg_num [pgp_num]]
> then:
> ceph osd pool set POOL size VALUE

With pool size 1 the writes are constant around 112MB/s:
http://pastebin.com/raw.php?i=haDPNTfQ

So has it something todo with the replication?

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 14:16                 ` Mark Nelson
@ 2012-05-30 18:42                   ` Stefan Priebe
       [not found]                     ` <CADdPHGuxa7TAyqXcXehb9WgKgkHwkybYTrj2oue_PKsiF+oR3A@mail.gmail.com>
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-30 18:42 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

Hi Mark,

> Specifically, the gap in the movie where there is no write activity
> around second 30 correlates in the blktrace results with one of these
> stalls:
>> 8,16 0 0 29.548567957 0 m N cfq schedule dispatch
>> 8,16 2 2185 34.548923918 2688 A W 2192 + 8 <- (8,17) 144
>
> As to why this is happening, I don't know yet. I'll have more later.
Should i try the bisect thing?

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
  2012-05-30 18:26                             ` Stefan Priebe
@ 2012-05-30 19:41                               ` Mark Nelson
  0 siblings, 0 replies; 73+ messages in thread
From: Mark Nelson @ 2012-05-30 19:41 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: Stefan Majer, ceph-devel@vger.kernel.org

On 05/30/2012 01:26 PM, Stefan Priebe wrote:
> Hi Mark,
>
> Am 30.05.2012 16:56, schrieb Mark Nelson:
>> On 05/30/2012 09:53 AM, Stefan Priebe wrote:
>>> Am 30.05.2012 16:49, schrieb Mark Nelson:
>>>> You could try setting up a pool with a replication level of 1 and see
>>>> how that does. It will be faster in any event, but it would be
>>>> interesting to see how much faster.
>>> is there an easier way than modifying the crush map?
> >
>> something like:
>> ceph osd pool create POOL [pg_num [pgp_num]]
>> then:
>> ceph osd pool set POOL size VALUE
>
> With pool size 1 the writes are constant around 112MB/s:
> http://pastebin.com/raw.php?i=haDPNTfQ
>
> So has it something todo with the replication?
>
> Stefan

Well now that is interesting.  Replication is pretty network heavy.  In 
addition to the client transfers to the OSDs, you have each OSD node 
sending and receiving data from each other.  Based on these results it 
looks like you may be stalling waiting for data to replicate so the 
client stops sending new requests.  If you set the osd, filestore, and 
messenger debugging up to like 20 you'll get a ton of info that may 
provide more clues.

Otherwise, a while ago I started making a list of performance related 
settings and tests that we (Inktank) may want to check for customers.  
Note that this is a work in progress and the values may not be exactly 
right yet.  You could check and see if any of the networking settings 
have changed on your setup between 3.0 and 3.4:

http://ceph.com/wiki/Performance_analysis

Also there was a thread a while back where Jim Schutt saw problems that 
looked like disk performance issues due to tcp autotuning policy:

http://www.spinics.net/lists/ceph-devel/msg05049.html

That seemed to be more an issue with lots of clients and OSDs per node, 
but I thought I'd mention it since some of the effects are similar.

Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4
       [not found]                     ` <CADdPHGuxa7TAyqXcXehb9WgKgkHwkybYTrj2oue_PKsiF+oR3A@mail.gmail.com>
@ 2012-05-30 21:10                       ` Stefan Priebe
       [not found]                         ` <CADdPHGutEwoDc=Kcrqcx2ZMO=dqhuoT5iLoP-WxqD+e5ZUmBRA@mail.gmail.com>
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe @ 2012-05-30 21:10 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

Am 30.05.2012 20:47, schrieb Stefan Majer:

>  From my perspective marks hints regarding blktrace end up in the same
> summary as the iostat ouput gives.
> You see stalls, not induced by disk by any means, no other obvious hints
> where the lag might come from.
> So if you want to know why kernels > 3.2 are slow for your workload i
> would drill down this with git bisect.

OK here are some tests regarding the kernel version - all made with XFS.

Starting with 3.2.0-rc1 it drops from 164MB/s (bonding) to 119MB/s but 
it never goes down to 0MB/s. 3.2.18 shows the same as 3.2-rc1.

Then with 3.3-rc1 i'm seeing even faster speed (178MB/s) than with 3.0.X 
- so everything is fine again. So it seems 3.2.X had another bug which 
reduced the speed which was fixed in 3.3.

Beginning with 3.3-rc4 it get's bad with drops to 0MB/s. So it should be 
a commit between 3.3-rc3 and 3.3-rc4. Sadly this are 370 commits. No 
idea where to start.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
       [not found]                         ` <CADdPHGutEwoDc=Kcrqcx2ZMO=dqhuoT5iLoP-WxqD+e5ZUmBRA@mail.gmail.com>
@ 2012-05-31  7:10                           ` Stefan Priebe - Profihost AG
  2012-05-31  7:30                             ` Yehuda Sadeh
       [not found]                             ` <CADdPHGv0YjxDQFnZML-55jDj7XxHxaxUZ_FeQ=ReKK6Rs7NNhw@mail.gmail.com>
  0 siblings, 2 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-31  7:10 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, ceph-devel

Hi Marc, Hi Stefan,

first thanks for all your help and time.

I found the commit which results in this problem and it is TCP related
but i'm still wondering if the expected behaviour of this commit is
expected?

The commit in question is:
git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
Author: Jason Wang <jasowang@redhat.com>
Date:   Thu Feb 2 00:07:00 2012 +0000

    tcp: properly initialize tcp memory limits

    Commit 4acb4190 tries to fix the using uninitialized value
    introduced by commit 3dc43e3,  but it would make the
    per-socket memory limits too small.

    This patch fixes this and also remove the redundant codes
    introduced in 4acb4190.

    Signed-off-by: Jason Wang <jasowang@redhat.com>
    Acked-by: Glauber Costa <glommer@parallels.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 4cb9cd2..7a7724d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
        struct ctl_table *table;
-       unsigned long limit;

        table = ipv4_net_table;
        if (!net_eq(net, &init_net)) {
@@ -815,11 +814,6 @@ static __net_init int ipv4_sysctl_init_net(struct
net *net)
        net->ipv4.sysctl_rt_cache_rebuild_count = 4;

        tcp_init_mem(net);
-       limit = nr_free_buffer_pages() / 8;
-       limit = max(limit, 128UL);
-       net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
-       net->ipv4.sysctl_tcp_mem[1] = limit;
-       net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;

        net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
                        net_ipv4_ctl_path, table);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a34f5cf..37755cc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3229,7 +3229,6 @@ __setup("thash_entries=", set_thash_entries);

 void tcp_init_mem(struct net *net)
 {
-       /* Set per-socket limits to no more than 1/128 the pressure
threshold */
        unsigned long limit = nr_free_buffer_pages() / 8;
        limit = max(limit, 128UL);
        net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
@@ -3298,7 +3297,8 @@ void __init tcp_init(void)
        sysctl_max_syn_backlog = max(128, cnt / 256);

        tcp_init_mem(&init_net);
-       limit = nr_free_buffer_pages() / 8;
+       /* Set per-socket limits to no more than 1/128 the pressure
threshold */
+       limit = nr_free_buffer_pages() << (PAGE_SHIFT - 10);
        limit = max(limit, 128UL);
        max_share = min(4UL*1024*1024, limit);

Greets
Stefan

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31  7:10                           ` poor OSD performance using kernel 3.4 => problem found Stefan Priebe - Profihost AG
@ 2012-05-31  7:30                             ` Yehuda Sadeh
       [not found]                               ` <CADdPHGtz9Jq624DMO6Dve2AcJ9vrnFHbyqRa+qheA+0-y4k++g@mail.gmail.com>
  2012-05-31 13:21                               ` Yann Dupont
       [not found]                             ` <CADdPHGv0YjxDQFnZML-55jDj7XxHxaxUZ_FeQ=ReKK6Rs7NNhw@mail.gmail.com>
  1 sibling, 2 replies; 73+ messages in thread
From: Yehuda Sadeh @ 2012-05-31  7:30 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Stefan Majer, Mark Nelson, ceph-devel

On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Hi Marc, Hi Stefan,
>
> first thanks for all your help and time.
>
> I found the commit which results in this problem and it is TCP related
> but i'm still wondering if the expected behaviour of this commit is
> expected?
>
> The commit in question is:
> git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
> commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
> Author: Jason Wang <jasowang@redhat.com>
> Date:   Thu Feb 2 00:07:00 2012 +0000
>
>    tcp: properly initialize tcp memory limits
>
>    Commit 4acb4190 tries to fix the using uninitialized value
>    introduced by commit 3dc43e3,  but it would make the
>    per-socket memory limits too small.
>
>    This patch fixes this and also remove the redundant codes
>    introduced in 4acb4190.
>
>    Signed-off-by: Jason Wang <jasowang@redhat.com>
>    Acked-by: Glauber Costa <glommer@parallels.com>
>    Signed-off-by: David S. Miller <davem@davemloft.net>
>
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 4cb9cd2..7a7724d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>  static __net_init int ipv4_sysctl_init_net(struct net *net)
>  {
>        struct ctl_table *table;
> -       unsigned long limit;
>
>        table = ipv4_net_table;
>        if (!net_eq(net, &init_net)) {
> @@ -815,11 +814,6 @@ static __net_init int ipv4_sysctl_init_net(struct
> net *net)
>        net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>
>        tcp_init_mem(net);
> -       limit = nr_free_buffer_pages() / 8;
> -       limit = max(limit, 128UL);
> -       net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
> -       net->ipv4.sysctl_tcp_mem[1] = limit;
> -       net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
>
>        net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>                        net_ipv4_ctl_path, table);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index a34f5cf..37755cc 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3229,7 +3229,6 @@ __setup("thash_entries=", set_thash_entries);
>
>  void tcp_init_mem(struct net *net)
>  {
> -       /* Set per-socket limits to no more than 1/128 the pressure
> threshold */
>        unsigned long limit = nr_free_buffer_pages() / 8;
>        limit = max(limit, 128UL);
>        net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
> @@ -3298,7 +3297,8 @@ void __init tcp_init(void)
>        sysctl_max_syn_backlog = max(128, cnt / 256);
>
>        tcp_init_mem(&init_net);
> -       limit = nr_free_buffer_pages() / 8;
> +       /* Set per-socket limits to no more than 1/128 the pressure
> threshold */
> +       limit = nr_free_buffer_pages() << (PAGE_SHIFT - 10);
>        limit = max(limit, 128UL);
>        max_share = min(4UL*1024*1024, limit);
>
Yeah, this might have affected the tcp performance. Looking at the
current linus tree this function looks more like it looked beforehand,
so it was probable reverted this way or another.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
       [not found]                             ` <CADdPHGv0YjxDQFnZML-55jDj7XxHxaxUZ_FeQ=ReKK6Rs7NNhw@mail.gmail.com>
@ 2012-05-31  8:04                               ` Stefan Priebe - Profihost AG
  2012-05-31  8:09                                 ` Stefan Majer
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-31  8:04 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, yehuda, ceph-devel@vger.kernel.org

Am 31.05.2012 09:27, schrieb Stefan Majer:
> we have set them in /etc/sysctl.conf to:
> net.ipv4.tcp_mem = 10000000 10000000 10000000

This does not help ;-(

> wow, this was fast !
> if i understand this commit correct it simply skips a in-kernel
> configuration of network related sysctl parameters, especialy
> net.ipv4.tcp_mem

I also tied this one:
net.ipv4.tcp_rmem = 4096 524287 16777216
net.ipv4.tcp_wmem = 4096 524287 16777216
# grabbed values from 3.0.X
net.ipv4.tcp_mem = 1162962      1550617 2325924

still - no help -. But if i use 3.4 and revert the commit it works fine.
But i wasn't able to find which other parts are influenced by this limit
while browsing through the source.

I only found:
net.ipv4.tcp_mem
and
net.ipv4.tcp_rmem
and
net.ipv4.tcp_wmem

Greets
Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31  8:04                               ` Stefan Priebe - Profihost AG
@ 2012-05-31  8:09                                 ` Stefan Majer
  2012-05-31 11:34                                   ` Stefan Priebe - Profihost AG
  2012-05-31 12:18                                   ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 73+ messages in thread
From: Stefan Majer @ 2012-05-31  8:09 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Mark Nelson, yehuda, ceph-devel@vger.kernel.org

Hi Stefan,

then you should probably describe this in a short mail to Jason Wang
and ask him how to circumvent this commit with sysctl settings.
I´m pretty sure my sysctl setting reverts the first part of the
commit. So probably the second part is the evil one ?

Greetings
Stefan

On Thu, May 31, 2012 at 10:04 AM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
>
> Am 31.05.2012 09:27, schrieb Stefan Majer:
> > we have set them in /etc/sysctl.conf to:
> > net.ipv4.tcp_mem = 10000000 10000000 10000000
>
> This does not help ;-(
>
> > wow, this was fast !
> > if i understand this commit correct it simply skips a in-kernel
> > configuration of network related sysctl parameters, especialy
> > net.ipv4.tcp_mem
>
> I also tied this one:
> net.ipv4.tcp_rmem = 4096 524287 16777216
> net.ipv4.tcp_wmem = 4096 524287 16777216
> # grabbed values from 3.0.X
> net.ipv4.tcp_mem = 1162962      1550617 2325924
>
> still - no help -. But if i use 3.4 and revert the commit it works fine.
> But i wasn't able to find which other parts are influenced by this limit
> while browsing through the source.
>
> I only found:
> net.ipv4.tcp_mem
> and
> net.ipv4.tcp_rmem
> and
> net.ipv4.tcp_wmem
>
> Greets
> Stefan




--
Stefan Majer
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31  8:09                                 ` Stefan Majer
@ 2012-05-31 11:34                                   ` Stefan Priebe - Profihost AG
  2012-05-31 12:18                                   ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-31 11:34 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, yehuda, ceph-devel@vger.kernel.org

Am 31.05.2012 10:09, schrieb Stefan Majer:
> Hi Stefan,
>
> then you should probably describe this in a short mail to Jason Wang
> and ask him how to circumvent this commit with sysctl settings.

done hopefully he can help

> I´m pretty sure my sysctl setting reverts the first part of the
> commit. So probably the second part is the evil one ?
Yes it seems like that

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31  8:09                                 ` Stefan Majer
  2012-05-31 11:34                                   ` Stefan Priebe - Profihost AG
@ 2012-05-31 12:18                                   ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-31 12:18 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Mark Nelson, yehuda, ceph-devel@vger.kernel.org

Hi Mark, Hi Stefan,

i found a way to solve it by comparing /proc/sys/net with an patched and 
an unpatched kernel.

Strangely the problem occours when the values are too big (in new kernel).

With the smaller values everything works fine even under 3.4. Any ideas 
how that can be? I thought these values should be tuned to a maximum for 
max performance.

- => old kernel
+ => new kernel

-/proc/sys/net/ipv4/tcp_rmem:4096       87380   6291456
+/proc/sys/net/ipv4/tcp_rmem:4096       87380   514873
-/proc/sys/net/ipv4/tcp_wmem:4096       16384   4194304
+/proc/sys/net/ipv4/tcp_wmem:4096       16384   514873


Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
       [not found]                               ` <CADdPHGtz9Jq624DMO6Dve2AcJ9vrnFHbyqRa+qheA+0-y4k++g@mail.gmail.com>
@ 2012-05-31 12:31                                 ` Mark Nelson
  2012-05-31 12:33                                   ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-31 12:31 UTC (permalink / raw)
  To: Stefan Majer; +Cc: Yehuda Sadeh, Stefan Priebe - Profihost AG, ceph-devel

Hi Stefan,

Please do share!  I was planning on starting out on the wiki and 
eventually getting these kinds of things into the master docs.  If you 
(and others) have already done testing it would be really interesting to 
compare experiences.  So far I've been just kind of throwing stuff into:

http://ceph.com/wiki/Performance_analysis

In it's current form it's pretty inadequate, but I'm hoping to 
eventually get back to it.  A lot of the work I've been doing recently 
is looking at underlying FS write behavior (specifically seeks) and if 
we can get any reasonable improvement through mkfs and mount options.

Mark

On 5/31/12 2:34 AM, Stefan Majer wrote:
> Hi,
>
> if Stefan confirms this as a solution it might me a good idea to 
> collect some performance optimizations hints for osds to 
> http://ceph.com/docs/master
> probably seperated in:
>
> Gigabit Ethernet based deployments
>  with Jumbo Frames
>
>  without Jumbo Frames
> 10 Gigabit Ethernet based deployments
>  with Jumbo Frames
>
>  without Jumbo Frames
>
> I can share some of our configurations as well
>
> Greetings
> Stefan
>
> On Thu, May 31, 2012 at 9:30 AM, Yehuda Sadeh <yehuda@inktank.com 
> <mailto:yehuda@inktank.com>> wrote:
>
>     On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
>     <s.priebe@profihost.ag <mailto:s.priebe@profihost.ag>> wrote:
>     > Hi Marc, Hi Stefan,
>     >
>     > first thanks for all your help and time.
>     >
>     > I found the commit which results in this problem and it is TCP
>     related
>     > but i'm still wondering if the expected behaviour of this commit is
>     > expected?
>     >
>     > The commit in question is:
>     > git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
>     > commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
>     > Author: Jason Wang <jasowang@redhat.com
>     <mailto:jasowang@redhat.com>>
>     > Date:   Thu Feb 2 00:07:00 2012 +0000
>     >
>     >    tcp: properly initialize tcp memory limits
>     >
>     >    Commit 4acb4190 tries to fix the using uninitialized value
>     >    introduced by commit 3dc43e3,  but it would make the
>     >    per-socket memory limits too small.
>     >
>     >    This patch fixes this and also remove the redundant codes
>     >    introduced in 4acb4190.
>     >
>     >    Signed-off-by: Jason Wang <jasowang@redhat.com
>     <mailto:jasowang@redhat.com>>
>     >    Acked-by: Glauber Costa <glommer@parallels.com
>     <mailto:glommer@parallels.com>>
>     >    Signed-off-by: David S. Miller <davem@davemloft.net
>     <mailto:davem@davemloft.net>>
>     >
>     > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>     > index 4cb9cd2..7a7724d 100644
>     > --- a/net/ipv4/sysctl_net_ipv4.c
>     > +++ b/net/ipv4/sysctl_net_ipv4.c
>     > @@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
>     >  static __net_init int ipv4_sysctl_init_net(struct net *net)
>     >  {
>     >        struct ctl_table *table;
>     > -       unsigned long limit;
>     >
>     >        table = ipv4_net_table;
>     >        if (!net_eq(net, &init_net)) {
>     > @@ -815,11 +814,6 @@ static __net_init int
>     ipv4_sysctl_init_net(struct
>     > net *net)
>     >        net->ipv4.sysctl_rt_cache_rebuild_count = 4;
>     >
>     >        tcp_init_mem(net);
>     > -       limit = nr_free_buffer_pages() / 8;
>     > -       limit = max(limit, 128UL);
>     > -       net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>     > -       net->ipv4.sysctl_tcp_mem[1] = limit;
>     > -       net->ipv4.sysctl_tcp_mem[2] =
>     net->ipv4.sysctl_tcp_mem[0] * 2;
>     >
>     >        net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>     >                        net_ipv4_ctl_path, table);
>     > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>     > index a34f5cf..37755cc 100644
>     > --- a/net/ipv4/tcp.c
>     > +++ b/net/ipv4/tcp.c
>     > @@ -3229 <tel:3229>,7 +3229,6 @@ __setup("thash_entries=",
>     set_thash_entries);
>     >
>     >  void tcp_init_mem(struct net *net)
>     >  {
>     > -       /* Set per-socket limits to no more than 1/128 the pressure
>     > threshold */
>     >        unsigned long limit = nr_free_buffer_pages() / 8;
>     >        limit = max(limit, 128UL);
>     >        net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
>     > @@ -3298 <tel:3298>,7 +3297,8 @@ void __init tcp_init(void)
>     >        sysctl_max_syn_backlog = max(128, cnt / 256);
>     >
>     >        tcp_init_mem(&init_net);
>     > -       limit = nr_free_buffer_pages() / 8;
>     > +       /* Set per-socket limits to no more than 1/128 the pressure
>     > threshold */
>     > +       limit = nr_free_buffer_pages() << (PAGE_SHIFT - 10);
>     >        limit = max(limit, 128UL);
>     >        max_share = min(4UL*1024*1024, limit);
>     >
>     Yeah, this might have affected the tcp performance. Looking at the
>     current linus tree this function looks more like it looked beforehand,
>     so it was probable reverted this way or another.
>
>     Yehuda
>
>
>
>
> -- 
> Stefan Majer



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 12:31                                 ` Mark Nelson
@ 2012-05-31 12:33                                   ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-31 12:33 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Stefan Majer, Yehuda Sadeh, ceph-devel

Am 31.05.2012 14:31, schrieb Mark Nelson:
> Hi Stefan,
>
> Please do share! I was planning on starting out on the wiki and
> eventually getting these kinds of things into the master docs. If you
> (and others) have already done testing it would be really interesting to
> compare experiences. So far I've been just kind of throwing stuff into:
>
> http://ceph.com/wiki/Performance_analysis
>
> In it's current form it's pretty inadequate, but I'm hoping to
> eventually get back to it. A lot of the work I've been doing recently is
> looking at underlying FS write behavior (specifically seeks) and if we
> can get any reasonable improvement through mkfs and mount options.

At least i'll start sharing when i've a fine running system ;-) I plan 
to switch to 10Gbe next week.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31  7:30                             ` Yehuda Sadeh
       [not found]                               ` <CADdPHGtz9Jq624DMO6Dve2AcJ9vrnFHbyqRa+qheA+0-y4k++g@mail.gmail.com>
@ 2012-05-31 13:21                               ` Yann Dupont
  2012-05-31 13:37                                 ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 73+ messages in thread
From: Yann Dupont @ 2012-05-31 13:21 UTC (permalink / raw)
  To: Yehuda Sadeh
  Cc: Stefan Priebe - Profihost AG, Stefan Majer, Mark Nelson,
	ceph-devel

On 31/05/2012 09:30, Yehuda Sadeh wrote:
> On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
>> Hi Marc, Hi Stefan,
>>

Hello, back today

Today, I upgraded my 2 last osd nodes with big storage, so now all my 
nodes are equivalent.

Using 3.4.0 kernel, I still have good results with rbd pool, but jumping 
values with data.


>> first thanks for all your help and time.
>>
>> I found the commit which results in this problem and it is TCP related
>> but i'm still wondering if the expected behaviour of this commit is
>> expected?
>

....

>>
> Yeah, this might have affected the tcp performance. Looking at the
> current linus tree this function looks more like it looked beforehand,
> so it was probable reverted this way or another!
>
> Yehuda

Well, I saw you probably found the culprit.

So tried the latest (this morning) git kernel.

Now data gives good results :

root@label5:~#  rados -p data bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       215       199   795.765       796  0.073769 0.0745517
     2      16       430       414   827.833       860  0.060165 0.0753952
     3      16       632       616   821.207       808  0.072241 0.0772463
     4      16       838       822   821.883       824  0.129571 0.0768741
     5      16      1039      1023   818.271       804  0.056867  0.077637
     6      16      1254      1238   825.209       860  0.078801 0.0771122
     7      16      1474      1458   833.023       880  0.062886 0.0764071
     8      16      1669      1653   826.389       780   0.09632 0.0767323
     9      16      1877      1861   827.003       832  0.083765 0.0770398
    10      16      2087      2071   828.294       840  0.051437  0.076937
    11      16      2309      2293   833.714       888  0.080584 0.0764829
    12      16      2535      2519   839.563       904  0.078095 0.0759574
    13      16      2762      2746   844.816       908  0.081323 0.0754571
    14      16      2984      2968   847.889       888  0.076973 0.0752921
    15      16      3203      3187   849.754       876  0.069877 0.0750613
    16      16      3437      3421   855.138       936  0.046845 0.0746941
    17      16      3655      3639   856.126       872  0.052258 0.0745157
    18      16      3862      3846   854.559       828  0.061542 0.0746875
    19      16      4085      4069   856.525       892  0.053889 0.0745582
min lat: 0.033007 max lat: 0.462951 avg lat: 0.0743988
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20      15      4308      4293   858.492       896  0.054176 0.0743988
Total time run:        20.103415
Total writes made:     4309
Write size:            4194304
Bandwidth (MB/sec):    857.367

Average Latency:       0.0746302
Max latency:           0.462951
Min latency:           0.033007



But very strangely it's now rbd that isn't stable ?!

root@label5:~#  rados -p rbd bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16       155       139    555.87       556  0.046232  0.109021
     2      16       250       234   467.923       380  0.046793 0.0985316
     3      16       250       234   311.955         0         - 0.0985316
     4      16       250       234   233.965         0         - 0.0985316
     5      16       250       234   187.173         0         - 0.0985316
     6      16       266       250   166.645        16  0.038083  0.175697
     7      16       266       250   142.839         0         -  0.175697
     8      16       441       425   212.475       350   0.05512  0.298391
     9      16       476       460   204.422       140   0.04372  0.280483
    10      16       531       515   205.976       220  0.125076  0.309449
    11      16       734       718    261.06       812  0.127582  0.244134
    12      16       795       779   259.637       244  0.065158  0.234156
    13      16       818       802   246.742        92  0.054514  0.241704
    14      16       830       814   232.546        48  0.044386  0.239006
    15      16       837       821   218.909        28   3.41523  0.267521
    16      16      1043      1027   256.721       824   0.04898  0.248212
    17      16      1147      1131   266.088       416  0.048591  0.232725
    18      16      1147      1131   251.305         0         -  0.232725
    19      16      1202      1186   249.657       110  0.081777   0.25501
min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    20      16      1296      1280    255.97       376  0.053797  0.245711
    21       9      1297      1288   245.305        32  0.708133  0.248248
    22       9      1297      1288   234.155         0         -  0.248248
    23       9      1297      1288   223.975         0         -  0.248248
    24       9      1297      1288   214.643         0         -  0.248248
    25       9      1297      1288   206.057         0         -  0.248248
    26       9      1297      1288   198.131         0         -  0.248248
Total time run:        26.829870
Total writes made:     1297
Write size:            4194304
Bandwidth (MB/sec):    193.367

Average Latency:       0.295922
Max latency:           7.36701
Min latency:           0.033773


Strange. I'm wondering if this has something to do with cache (that is, 
operation I could have done before on nodes, as all my nodes are just 
freshly rebooted).

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 13:21                               ` Yann Dupont
@ 2012-05-31 13:37                                 ` Stefan Priebe - Profihost AG
  2012-05-31 13:45                                   ` Yann Dupont
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-31 13:37 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Yehuda Sadeh, Stefan Majer, Mark Nelson, ceph-devel

Am 31.05.2012 15:21, schrieb Yann Dupont:
> On 31/05/2012 09:30, Yehuda Sadeh wrote:
>> On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
>> <s.priebe@profihost.ag> wrote:
> But very strangely it's now rbd that isn't stable ?!
>
> root@label5:~# rados -p rbd bench 20 write -t 16
> Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> 0 0 0 0 0 0 - 0
> 1 16 155 139 555.87 556 0.046232 0.109021
> 2 16 250 234 467.923 380 0.046793 0.0985316
> 3 16 250 234 311.955 0 - 0.0985316
> 4 16 250 234 233.965 0 - 0.0985316
> 5 16 250 234 187.173 0 - 0.0985316
> 6 16 266 250 166.645 16 0.038083 0.175697
> 7 16 266 250 142.839 0 - 0.175697
> 8 16 441 425 212.475 350 0.05512 0.298391
> 9 16 476 460 204.422 140 0.04372 0.280483
> 10 16 531 515 205.976 220 0.125076 0.309449
> 11 16 734 718 261.06 812 0.127582 0.244134
> 12 16 795 779 259.637 244 0.065158 0.234156
> 13 16 818 802 246.742 92 0.054514 0.241704
> 14 16 830 814 232.546 48 0.044386 0.239006
> 15 16 837 821 218.909 28 3.41523 0.267521
> 16 16 1043 1027 256.721 824 0.04898 0.248212
> 17 16 1147 1131 266.088 416 0.048591 0.232725
> 18 16 1147 1131 251.305 0 - 0.232725
> 19 16 1202 1186 249.657 110 0.081777 0.25501
> min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> 20 16 1296 1280 255.97 376 0.053797 0.245711
> 21 9 1297 1288 245.305 32 0.708133 0.248248
> 22 9 1297 1288 234.155 0 - 0.248248
> 23 9 1297 1288 223.975 0 - 0.248248
> 24 9 1297 1288 214.643 0 - 0.248248
> 25 9 1297 1288 206.057 0 - 0.248248
> 26 9 1297 1288 198.131 0 - 0.248248
> Total time run: 26.829870
> Total writes made: 1297
> Write size: 4194304
> Bandwidth (MB/sec): 193.367
>
> Average Latency: 0.295922
> Max latency: 7.36701
> Min latency: 0.033773
>
>
> Strange. I'm wondering if this has something to do with cache (that is,
> operation I could have done before on nodes, as all my nodes are just
> freshly rebooted).

Please test setting these values on all OSDs and Clients:
sysctl -w net.ipv4.tcp_rmem="4096        87380   514873"
sysctl -w net.ipv4.tcp_wmem="4096        16384   514873"

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 13:37                                 ` Stefan Priebe - Profihost AG
@ 2012-05-31 13:45                                   ` Yann Dupont
  2012-05-31 14:42                                     ` Yann Dupont
  0 siblings, 1 reply; 73+ messages in thread
From: Yann Dupont @ 2012-05-31 13:45 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Yehuda Sadeh, Stefan Majer, Mark Nelson, ceph-devel

On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote:
> Am 31.05.2012 15:21, schrieb Yann Dupont:
>> On 31/05/2012 09:30, Yehuda Sadeh wrote:
>>> On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
>>> <s.priebe@profihost.ag> wrote:
>> But very strangely it's now rbd that isn't stable ?!
>>
>> root@label5:~# rados -p rbd bench 20 write -t 16
>> Maintaining 16 concurrent writes of 4194304 bytes for at least 20
>> seconds.
>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>> 0 0 0 0 0 0 - 0
>> 1 16 155 139 555.87 556 0.046232 0.109021
>> 2 16 250 234 467.923 380 0.046793 0.0985316
>> 3 16 250 234 311.955 0 - 0.0985316
>> 4 16 250 234 233.965 0 - 0.0985316
>> 5 16 250 234 187.173 0 - 0.0985316
>> 6 16 266 250 166.645 16 0.038083 0.175697
>> 7 16 266 250 142.839 0 - 0.175697
>> 8 16 441 425 212.475 350 0.05512 0.298391
>> 9 16 476 460 204.422 140 0.04372 0.280483
>> 10 16 531 515 205.976 220 0.125076 0.309449
>> 11 16 734 718 261.06 812 0.127582 0.244134
>> 12 16 795 779 259.637 244 0.065158 0.234156
>> 13 16 818 802 246.742 92 0.054514 0.241704
>> 14 16 830 814 232.546 48 0.044386 0.239006
>> 15 16 837 821 218.909 28 3.41523 0.267521
>> 16 16 1043 1027 256.721 824 0.04898 0.248212
>> 17 16 1147 1131 266.088 416 0.048591 0.232725
>> 18 16 1147 1131 251.305 0 - 0.232725
>> 19 16 1202 1186 249.657 110 0.081777 0.25501
>> min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711
>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>> 20 16 1296 1280 255.97 376 0.053797 0.245711
>> 21 9 1297 1288 245.305 32 0.708133 0.248248
>> 22 9 1297 1288 234.155 0 - 0.248248
>> 23 9 1297 1288 223.975 0 - 0.248248
>> 24 9 1297 1288 214.643 0 - 0.248248
>> 25 9 1297 1288 206.057 0 - 0.248248
>> 26 9 1297 1288 198.131 0 - 0.248248
>> Total time run: 26.829870
>> Total writes made: 1297
>> Write size: 4194304
>> Bandwidth (MB/sec): 193.367
>>
>> Average Latency: 0.295922
>> Max latency: 7.36701
>> Min latency: 0.033773
>>
>>
>> Strange. I'm wondering if this has something to do with cache (that is,
>> operation I could have done before on nodes, as all my nodes are just
>> freshly rebooted).
>
> Please test setting these values on all OSDs and Clients:
> sysctl -w net.ipv4.tcp_rmem="4096        87380   514873"
> sysctl -w net.ipv4.tcp_wmem="4096        16384   514873"
>
> Stefan

same. stable for pool data (845 MB/s average), jumping with rbd (229 
average, with a max latency of 6).

I'm with latest linus git kernel
(commit af56e0aa35f3ae2a4c1a6d1000702df1dd78cb76) , and I based on the 
fact that the patch was reversed on it.

I can try with plain 3.4.0 with 'culprit patch' manually reversed.

what puzzles me is that this morning, with 3.4.0 it was rbd that was 
stable, and now I have the exact contrary.

I'll begin to reboot with old 3.4.0 kernel to see if things are 
reproductible.

Cheers,
-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 13:45                                   ` Yann Dupont
@ 2012-05-31 14:42                                     ` Yann Dupont
  2012-05-31 15:32                                       ` Mark Nelson
  0 siblings, 1 reply; 73+ messages in thread
From: Yann Dupont @ 2012-05-31 14:42 UTC (permalink / raw)
  To: Yann Dupont
  Cc: Stefan Priebe - Profihost AG, Yehuda Sadeh, Stefan Majer,
	Mark Nelson, ceph-devel

On 31/05/2012 15:45, Yann Dupont wrote:
> On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote:

> what puzzles me is that this morning, with 3.4.0 it was rbd that was
> stable, and now I have the exact contrary.
>
> I'll begin to reboot with old 3.4.0 kernel to see if things are
> reproductible.
>
> Cheers,

I'd say my problem is probably not related. Freshly rebooting all osd 
nodes with 3.4.0 kernel (the same kernel I used this morning) now gives 
pool data stable & rbd unstable. As with current git, and the exact 
opposite of results I had tuesday & this morning.

Go figure.

Could it have to do with previous usage in OSD ? or active mds ? or mon ?

As I already said, as my osd are using btrfs with big medata features, 
so going back in 3.0 kernel need a complete reformat of my OSD before.

But I will do it if you judge it can bring some light on this case.

Cheers,
-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 14:42                                     ` Yann Dupont
@ 2012-05-31 15:32                                       ` Mark Nelson
  2012-05-31 15:43                                         ` Yann Dupont
  0 siblings, 1 reply; 73+ messages in thread
From: Mark Nelson @ 2012-05-31 15:32 UTC (permalink / raw)
  To: Yann Dupont
  Cc: Stefan Priebe - Profihost AG, Yehuda Sadeh, Stefan Majer,
	ceph-devel

On 05/31/2012 09:42 AM, Yann Dupont wrote:
> On 31/05/2012 15:45, Yann Dupont wrote:
>> On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote:
>
>> what puzzles me is that this morning, with 3.4.0 it was rbd that was
>> stable, and now I have the exact contrary.
>>
>> I'll begin to reboot with old 3.4.0 kernel to see if things are
>> reproductible.
>>
>> Cheers,
>
>
> I'd say my problem is probably not related. Freshly rebooting all osd 
> nodes with 3.4.0 kernel (the same kernel I used this morning) now 
> gives pool data stable & rbd unstable. As with current git, and the 
> exact opposite of results I had tuesday & this morning.
>
> Go figure.
>
> Could it have to do with previous usage in OSD ? or active mds ? or mon ?
>
> As I already said, as my osd are using btrfs with big medata features, 
> so going back in 3.0 kernel need a complete reformat of my OSD before.
>
> But I will do it if you judge it can bring some light on this case.
>
> Cheers,
Hi Yann,

Can you take a look at how many PGs are in each pool?

ceph osd pool get<pool>  pg_num


Thanks,
Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 15:32                                       ` Mark Nelson
@ 2012-05-31 15:43                                         ` Yann Dupont
  2012-05-31 16:14                                           ` Mark Nelson
  2012-05-31 16:29                                           ` Sage Weil
  0 siblings, 2 replies; 73+ messages in thread
From: Yann Dupont @ 2012-05-31 15:43 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Stefan Priebe - Profihost AG, Yehuda Sadeh, Stefan Majer,
	ceph-devel

On 31/05/2012 17:32, Mark Nelson wrote:
> ceph osd pool get<pool>  pg_num

My setup is detailed in a previous mail , But as I changed some 
parameters this morning, here we go :

root@chichibu:~# ceph osd pool get data pg_num
PG_NUM: 576
root@chichibu:~# ceph osd pool get rbd pg_num
PG_NUM: 576



The pg num is quite low because I started with small OSD (9 osd with 
200G each - internal disks) when I formatted. Now, I reduced to 8 osd, 
(osd.4 is out) but with much larger (& faster) storage.


Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the 
OSD similars. Replication is set to 2.


The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted 
via space_cache,compress=lzo,nobarrier,noatime.

journal is on tmpfs :
  osd journal = /dev/shm/journal
  osd journal size = 6144

I know this is dangerous, remember It's NOT a production system for the 
moment.

No OSD is full, I don't have much data stored for the moment.

Concerning crush map, I'm not using the default one :

The 8 nodes are in 3 different locations (some kilometers away). 2 are 
in 1 place, 2 in another, and the 4 last in the principal place.

There is 10G between all the nodes and they are in the same VLAN, no 
router involved (but there is (negligible ?) latency between nodes)

I try to group host together to avoid problem when I loose a location 
(electrical problem, for example). Not sure I really customized the 
crush map as I should have.

here is the map :
  begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 pool

# buckets
host karuizawa {
     id -5        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.2 weight 1.000
}
host hazelburn {
     id -6        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.3 weight 1.000
}
rack loire {
     id -3        # do not change unnecessarily
     # weight 2.000
     alg straw
     hash 0    # rjenkins1
     item karuizawa weight 1.000
     item hazelburn weight 1.000
}
host carsebridge {
     id -8        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.5 weight 1.000
}
host cameronbridge {
     id -9        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.6 weight 1.000
}
rack chantrerie {
     id -7        # do not change unnecessarily
     # weight 2.000
     alg straw
     hash 0    # rjenkins1
     item carsebridge weight 1.000
     item cameronbridge weight 1.000
}
host chichibu {
     id -2        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.0 weight 1.000
}
host glenesk {
     id -4        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.1 weight 1.000
}
host braeval {
     id -10        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.7 weight 1.000
}
host hanyu {
     id -11        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.8 weight 1.000
}
rack lombarderie {
     id -12        # do not change unnecessarily
     # weight 4.000
     alg straw
     hash 0    # rjenkins1
     item chichibu weight 1.000
     item glenesk weight 1.000
     item braeval weight 1.000
     item hanyu weight 1.000
}
pool default {
     id -1        # do not change unnecessarily
     # weight 8.000
     alg straw
     hash 0    # rjenkins1
     item loire weight 2.000
     item chantrerie weight 2.000
     item lombarderie weight 4.000
}

# rules
rule data {
     ruleset 0
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type host
     step emit
}
rule metadata {
     ruleset 1
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type host
     step emit
}
rule rbd {
     ruleset 2
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type host
     step emit
}

# end crush map

Hope it helps,
cheers


-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 15:43                                         ` Yann Dupont
@ 2012-05-31 16:14                                           ` Mark Nelson
  2012-05-31 16:29                                           ` Sage Weil
  1 sibling, 0 replies; 73+ messages in thread
From: Mark Nelson @ 2012-05-31 16:14 UTC (permalink / raw)
  To: Yann Dupont
  Cc: Stefan Priebe - Profihost AG, Yehuda Sadeh, Stefan Majer,
	ceph-devel

On 05/31/2012 10:43 AM, Yann Dupont wrote:
> On 31/05/2012 17:32, Mark Nelson wrote:
>> ceph osd pool get<pool> pg_num
>
> My setup is detailed in a previous mail , But as I changed some
> parameters this morning, here we go :
>
> root@chichibu:~# ceph osd pool get data pg_num
> PG_NUM: 576
> root@chichibu:~# ceph osd pool get rbd pg_num
> PG_NUM: 576
>
>
>
> The pg num is quite low because I started with small OSD (9 osd with
> 200G each - internal disks) when I formatted. Now, I reduced to 8 osd,
> (osd.4 is out) but with much larger (& faster) storage.
>
>
> Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the
> OSD similars. Replication is set to 2.
>
>
> The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted
> via space_cache,compress=lzo,nobarrier,noatime.
>
> journal is on tmpfs :
> osd journal = /dev/shm/journal
> osd journal size = 6144
>
> I know this is dangerous, remember It's NOT a production system for the
> moment.
>
> No OSD is full, I don't have much data stored for the moment.
>
> Concerning crush map, I'm not using the default one :
>
> The 8 nodes are in 3 different locations (some kilometers away). 2 are
> in 1 place, 2 in another, and the 4 last in the principal place.
>
> There is 10G between all the nodes and they are in the same VLAN, no
> router involved (but there is (negligible ?) latency between nodes)
>
> I try to group host together to avoid problem when I loose a location
> (electrical problem, for example). Not sure I really customized the
> crush map as I should have.
>
> here is the map :
> begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 device4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 pool
>
> # buckets
> host karuizawa {
> id -5 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.2 weight 1.000
> }
> host hazelburn {
> id -6 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.3 weight 1.000
> }
> rack loire {
> id -3 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item karuizawa weight 1.000
> item hazelburn weight 1.000
> }
> host carsebridge {
> id -8 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.5 weight 1.000
> }
> host cameronbridge {
> id -9 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.6 weight 1.000
> }
> rack chantrerie {
> id -7 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item carsebridge weight 1.000
> item cameronbridge weight 1.000
> }
> host chichibu {
> id -2 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 1.000
> }
> host glenesk {
> id -4 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 1.000
> }
> host braeval {
> id -10 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.7 weight 1.000
> }
> host hanyu {
> id -11 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.8 weight 1.000
> }
> rack lombarderie {
> id -12 # do not change unnecessarily
> # weight 4.000
> alg straw
> hash 0 # rjenkins1
> item chichibu weight 1.000
> item glenesk weight 1.000
> item braeval weight 1.000
> item hanyu weight 1.000
> }
> pool default {
> id -1 # do not change unnecessarily
> # weight 8.000
> alg straw
> hash 0 # rjenkins1
> item loire weight 2.000
> item chantrerie weight 2.000
> item lombarderie weight 4.000
> }
>
> # rules
> rule data {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> rule metadata {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
> rule rbd {
> ruleset 2
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
>
> Hope it helps,
> cheers
>
>

Hi Yann,

You might want to start out by running sar/iostat/collectl on the OSD 
nodes and seeing if anything looks funny during the slow test compared 
to the fast one.  If that doesn't reveal much, you could run blktrace on 
one of the OSDs during the tests and see if the IO to the disk looks 
different.  I can help out if you want to send me your blktrace results. 
  Similarly you could watch the network streams for both tests and see 
if anything looks different there.

Thanks!
Mark

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 15:43                                         ` Yann Dupont
  2012-05-31 16:14                                           ` Mark Nelson
@ 2012-05-31 16:29                                           ` Sage Weil
  2012-05-31 16:37                                             ` Yann Dupont
  1 sibling, 1 reply; 73+ messages in thread
From: Sage Weil @ 2012-05-31 16:29 UTC (permalink / raw)
  To: Yann Dupont
  Cc: Mark Nelson, Stefan Priebe - Profihost AG, Yehuda Sadeh,
	Stefan Majer, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5385 bytes --]

On Thu, 31 May 2012, Yann Dupont wrote:
> On 31/05/2012 17:32, Mark Nelson wrote:
> > ceph osd pool get<pool>  pg_num
> 
> My setup is detailed in a previous mail , But as I changed some parameters
> this morning, here we go :
> 
> root@chichibu:~# ceph osd pool get data pg_num
> PG_NUM: 576
> root@chichibu:~# ceph osd pool get rbd pg_num
> PG_NUM: 576

Can you post 'ceph osd dump | grep ^pool' so we can see which CRUSH rules 
the pools are mapped to?

Thanks!
sage


> 
> 
> 
> The pg num is quite low because I started with small OSD (9 osd with 200G each
> - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 is out)
> but with much larger (& faster) storage.
> 
> 
> Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the OSD
> similars. Replication is set to 2.
> 
> 
> The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted via
> space_cache,compress=lzo,nobarrier,noatime.
> 
> journal is on tmpfs :
>  osd journal = /dev/shm/journal
>  osd journal size = 6144
> 
> I know this is dangerous, remember It's NOT a production system for the
> moment.
> 
> No OSD is full, I don't have much data stored for the moment.
> 
> Concerning crush map, I'm not using the default one :
> 
> The 8 nodes are in 3 different locations (some kilometers away). 2 are in 1
> place, 2 in another, and the 4 last in the principal place.
> 
> There is 10G between all the nodes and they are in the same VLAN, no router
> involved (but there is (negligible ?) latency between nodes)
> 
> I try to group host together to avoid problem when I loose a location
> (electrical problem, for example). Not sure I really customized the crush map
> as I should have.
> 
> here is the map :
>  begin crush map
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 device4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> 
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 pool
> 
> # buckets
> host karuizawa {
>     id -5        # do not change unnecessarily
>     # weight 1.000
>     alg straw
>     hash 0    # rjenkins1
>     item osd.2 weight 1.000
> }
> host hazelburn {
>     id -6        # do not change unnecessarily
>     # weight 1.000
>     alg straw
>     hash 0    # rjenkins1
>     item osd.3 weight 1.000
> }
> rack loire {
>     id -3        # do not change unnecessarily
>     # weight 2.000
>     alg straw
>     hash 0    # rjenkins1
>     item karuizawa weight 1.000
>     item hazelburn weight 1.000
> }
> host carsebridge {
>     id -8        # do not change unnecessarily
>     # weight 1.000
>     alg straw
>     hash 0    # rjenkins1
>     item osd.5 weight 1.000
> }
> host cameronbridge {
>     id -9        # do not change unnecessarily
>     # weight 1.000
>     alg straw
>     hash 0    # rjenkins1
>     item osd.6 weight 1.000
> }
> rack chantrerie {
>     id -7        # do not change unnecessarily
>     # weight 2.000
>     alg straw
>     hash 0    # rjenkins1
>     item carsebridge weight 1.000
>     item cameronbridge weight 1.000
> }
> host chichibu {
>     id -2        # do not change unnecessarily
>     # weight 1.000
>     alg straw
>     hash 0    # rjenkins1
>     item osd.0 weight 1.000
> }
> host glenesk {
>     id -4        # do not change unnecessarily
>     # weight 1.000
>     alg straw
>     hash 0    # rjenkins1
>     item osd.1 weight 1.000
> }
> host braeval {
>     id -10        # do not change unnecessarily
>     # weight 1.000
>     alg straw
>     hash 0    # rjenkins1
>     item osd.7 weight 1.000
> }
> host hanyu {
>     id -11        # do not change unnecessarily
>     # weight 1.000
>     alg straw
>     hash 0    # rjenkins1
>     item osd.8 weight 1.000
> }
> rack lombarderie {
>     id -12        # do not change unnecessarily
>     # weight 4.000
>     alg straw
>     hash 0    # rjenkins1
>     item chichibu weight 1.000
>     item glenesk weight 1.000
>     item braeval weight 1.000
>     item hanyu weight 1.000
> }
> pool default {
>     id -1        # do not change unnecessarily
>     # weight 8.000
>     alg straw
>     hash 0    # rjenkins1
>     item loire weight 2.000
>     item chantrerie weight 2.000
>     item lombarderie weight 4.000
> }
> 
> # rules
> rule data {
>     ruleset 0
>     type replicated
>     min_size 1
>     max_size 10
>     step take default
>     step chooseleaf firstn 0 type host
>     step emit
> }
> rule metadata {
>     ruleset 1
>     type replicated
>     min_size 1
>     max_size 10
>     step take default
>     step chooseleaf firstn 0 type host
>     step emit
> }
> rule rbd {
>     ruleset 2
>     type replicated
>     min_size 1
>     max_size 10
>     step take default
>     step chooseleaf firstn 0 type host
>     step emit
> }
> 
> # end crush map
> 
> Hope it helps,
> cheers
> 
> 
> -- 
> Yann Dupont - Service IRTS, DSI Université de Nantes
> Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: poor OSD performance using kernel 3.4 => problem found
  2012-05-31 16:29                                           ` Sage Weil
@ 2012-05-31 16:37                                             ` Yann Dupont
  0 siblings, 0 replies; 73+ messages in thread
From: Yann Dupont @ 2012-05-31 16:37 UTC (permalink / raw)
  To: Sage Weil
  Cc: Mark Nelson, Stefan Priebe - Profihost AG, Yehuda Sadeh,
	Stefan Majer, ceph-devel

Le 31/05/2012 18:29, Sage Weil a écrit :

> Can you post 'ceph osd dump | grep ^pool' so we can see which CRUSH rules
> the pools are mapped to?
>

yes :

root@label5:~# ceph osd dump | grep ^pool
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 576 
pgp_num 576 last_change 816 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 
576 pgp_num 576 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 576 
pgp_num 576 last_change 1 owner 0

cheers,


-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2012-05-31 16:37 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-24 14:10 poor OSD performance using kernel 3.4 Stefan Priebe - Profihost AG
2012-05-24 14:57 ` Mark Nelson
     [not found] ` <CAJCPpW+SKnnVUaDEAsCkKyZwMVrHCRJF2C8zqB4eORgwW5p=1Q@mail.gmail.com>
     [not found]   ` <4FBE7ABC.5020502@profihost.ag>
2012-05-24 18:53     ` Mark Nelson
2012-05-24 19:05       ` Stefan Priebe
2012-05-25  1:53         ` Mark Nelson
2012-05-25  8:19           ` Stefan Priebe - Profihost AG
2012-05-25 11:31             ` Stefan Priebe - Profihost AG
2012-05-25 12:10               ` Stefan Priebe - Profihost AG
2012-05-25 15:47                 ` Alexandre DERUMIER
2012-05-27  9:11                   ` Stefan Priebe - Profihost AG
2012-05-27 11:33                     ` Alexandre DERUMIER
2012-05-27 18:57                       ` Stefan Priebe
2012-05-28  5:37                         ` Alexandre DERUMIER
2012-05-28  6:25                           ` Stefan Priebe
2012-05-28  6:52                             ` Alexandre DERUMIER
2012-05-28 19:48                               ` Stefan Priebe
2012-05-29  3:54                                 ` Alexandre DERUMIER
2012-05-29  8:22                                   ` Stefan Priebe - Profihost AG
2012-05-29 13:01                                     ` Alexandre DERUMIER
2012-05-29 14:18                                       ` Stefan Priebe - Profihost AG
2012-05-29  9:46                                   ` Stefan Priebe - Profihost AG
2012-05-29 13:39                                     ` Yann Dupont
2012-05-29 14:43                                       ` Stefan Priebe - Profihost AG
2012-05-29 17:50                                         ` Mark Nelson
2012-05-29 19:50                                           ` Yann Dupont
2012-05-29 21:04                                           ` Stefan Priebe
2012-05-29 21:08                                           ` Stefan Priebe
2012-05-29 21:31                                             ` Yann Dupont
2012-05-29 21:34                                               ` Stefan Priebe
2012-05-29 21:45                                                 ` Yann Dupont
2012-05-30  6:29                                                   ` Stefan Priebe - Profihost AG
2012-05-29 21:41                                             ` Mark Nelson
2012-05-30  6:22                                               ` Stefan Priebe - Profihost AG
2012-05-30  7:20                                                 ` building test cluster : missing /etc/ceph/client.admin.keyring, need help Alexandre DERUMIER
2012-05-30  7:25                                                   ` Stefan Priebe - Profihost AG
2012-05-30  7:33                                                     ` Alexandre DERUMIER
2012-05-30  7:47                                                       ` Alexandre DERUMIER
2012-05-29 22:25 ` poor OSD performance using kernel 3.4 Mark Nelson
2012-05-30  6:33   ` Stefan Priebe - Profihost AG
     [not found]     ` <CADdPHGs9dpSh9Oyu+5yDhyYU=Et_-zF5MuYybBuuAN5DgR433A@mail.gmail.com>
2012-05-30  7:16       ` Stefan Priebe - Profihost AG
     [not found]         ` <CADdPHGuiJqZUCK-0qR_CrOo6GRhkjaCdkOhJ2boq3zD0_voTsA@mail.gmail.com>
2012-05-30 11:04           ` Stefan Priebe - Profihost AG
     [not found]             ` <CADdPHGuLAL5+hkzq0tigqu355DvPxkhE5sxBhOVZPj=EzDSVtA@mail.gmail.com>
2012-05-30 11:25               ` Stefan Priebe - Profihost AG
2012-05-30 12:17             ` Mark Nelson
2012-05-30 12:41               ` Stefan Priebe - Profihost AG
     [not found]                 ` <CADdPHGsmr8Ht1pTWH1Oe8=NmAyM81SSdH+c_GV89D8ntfyUmgA@mail.gmail.com>
2012-05-30 13:19                   ` Stefan Priebe - Profihost AG
     [not found]                     ` <CADdPHGvxCmuViy+0==Vkdz_QjC1K+kD5kD1m7+0tYM2YDTtJbw@mail.gmail.com>
2012-05-30 13:54                       ` Stefan Priebe - Profihost AG
     [not found]                       ` <4FC63381.6090300@inktank.com>
2012-05-30 14:53                         ` Stefan Priebe
2012-05-30 14:56                           ` Mark Nelson
2012-05-30 18:26                             ` Stefan Priebe
2012-05-30 19:41                               ` Mark Nelson
2012-05-30 13:27                 ` Mark Nelson
2012-05-30 13:51                   ` Stefan Priebe - Profihost AG
2012-05-30 14:16                 ` Mark Nelson
2012-05-30 18:42                   ` Stefan Priebe
     [not found]                     ` <CADdPHGuxa7TAyqXcXehb9WgKgkHwkybYTrj2oue_PKsiF+oR3A@mail.gmail.com>
2012-05-30 21:10                       ` Stefan Priebe
     [not found]                         ` <CADdPHGutEwoDc=Kcrqcx2ZMO=dqhuoT5iLoP-WxqD+e5ZUmBRA@mail.gmail.com>
2012-05-31  7:10                           ` poor OSD performance using kernel 3.4 => problem found Stefan Priebe - Profihost AG
2012-05-31  7:30                             ` Yehuda Sadeh
     [not found]                               ` <CADdPHGtz9Jq624DMO6Dve2AcJ9vrnFHbyqRa+qheA+0-y4k++g@mail.gmail.com>
2012-05-31 12:31                                 ` Mark Nelson
2012-05-31 12:33                                   ` Stefan Priebe - Profihost AG
2012-05-31 13:21                               ` Yann Dupont
2012-05-31 13:37                                 ` Stefan Priebe - Profihost AG
2012-05-31 13:45                                   ` Yann Dupont
2012-05-31 14:42                                     ` Yann Dupont
2012-05-31 15:32                                       ` Mark Nelson
2012-05-31 15:43                                         ` Yann Dupont
2012-05-31 16:14                                           ` Mark Nelson
2012-05-31 16:29                                           ` Sage Weil
2012-05-31 16:37                                             ` Yann Dupont
     [not found]                             ` <CADdPHGv0YjxDQFnZML-55jDj7XxHxaxUZ_FeQ=ReKK6Rs7NNhw@mail.gmail.com>
2012-05-31  8:04                               ` Stefan Priebe - Profihost AG
2012-05-31  8:09                                 ` Stefan Majer
2012-05-31 11:34                                   ` Stefan Priebe - Profihost AG
2012-05-31 12:18                                   ` Stefan Priebe - Profihost AG
2012-05-30 11:51     ` poor OSD performance using kernel 3.4 Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.