reads while 100% write

All of lore.kernel.org
 help / color / mirror / Atom feed

* reads while 100% write
@ 2016-03-30  2:35 Evgeniy Firsov
  2016-03-30 11:19 ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: Evgeniy Firsov @ 2016-03-30  2:35 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

After pulling master branch on Friday I start seeing odd fio behavior, I
see a lot of reads while writing and very low performance no matter
whether it read or write workload.

Output from sequential 1M write:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util

sdd               0.00   409.00    0.00  364.00     0.00  3092.00    16.99
    0.28    0.78    0.00    0.78   0.76  27.60
sde               0.00   242.00  365.00  363.00  2436.00  9680.00    33.29
    0.18    0.24    0.42    0.07   0.23  16.80



block.db -> /dev/sdd
block -> /dev/sde

health HEALTH_OK
monmap e1: 1 mons at {a=127.0.0.1:6789/0}
       election epoch 3, quorum 0 a
osdmap e7: 1 osds: 1 up, 1 in
       flags sortbitwise
pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
       8210 MB used, 178 GB / 186 GB avail
             64 active+clean
client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr


While on earlier revision(c1e41af) everything looks as expected:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00  4910.00    0.00  680.00     0.00 22416.00    65.93
    1.05    1.55    0.00    1.55   1.18  80.00
sde               0.00     0.00    0.00 3418.00     0.00 217612.00
127.33    63.78   18.18    0.00   18.18   0.25  86.40

Other observation, may be related to the issue, is that CPU load is
imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest is idle.
Looks like all load goes to single thread pool shard, earlier CPU was well
balanced.


‹
Evgeniy



PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30  2:35 reads while 100% write Evgeniy Firsov
@ 2016-03-30 11:19 ` Sage Weil
  2016-03-30 18:14   ` Evgeniy Firsov
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-03-30 11:19 UTC (permalink / raw)
  To: Evgeniy Firsov; +Cc: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2702 bytes --]

On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> After pulling master branch on Friday I start seeing odd fio behavior, I
> see a lot of reads while writing and very low performance no matter
> whether it read or write workload.
> 
> Output from sequential 1M write:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> 
> sdd               0.00   409.00    0.00  364.00     0.00  3092.00    16.99
>     0.28    0.78    0.00    0.78   0.76  27.60
> sde               0.00   242.00  365.00  363.00  2436.00  9680.00    33.29
>     0.18    0.24    0.42    0.07   0.23  16.80
> 
> 
> 
> block.db -> /dev/sdd
> block -> /dev/sde
> 
> health HEALTH_OK
> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
>        election epoch 3, quorum 0 a
> osdmap e7: 1 osds: 1 up, 1 in
>        flags sortbitwise
> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
>        8210 MB used, 178 GB / 186 GB avail
>              64 active+clean
> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> 
> 
> While on earlier revision(c1e41af) everything looks as expected:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00    65.93
>     1.05    1.55    0.00    1.55   1.18  80.00
> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> 
> Other observation, may be related to the issue, is that CPU load is
> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest is idle.
> Looks like all load goes to single thread pool shard, earlier CPU was well
> balanced.

Hmm.  Can you capture a log with debug bluestore = 20 and debug bdev = 20?

Thanks!
sage


> 
> 
> ‹
> Evgeniy
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 11:19 ` Sage Weil
@ 2016-03-30 18:14   ` Evgeniy Firsov
  2016-03-30 18:37     ` Jason Dillaman
  0 siblings, 1 reply; 16+ messages in thread
From: Evgeniy Firsov @ 2016-03-30 18:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

These are suspicious lines:

2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0) read
0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 6144018~6012 =
6012
2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0) read
0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
_do_read 8210~4096 size 6150030
2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block) read
8003854336~8192
2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0) read
0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 = 4096
2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0) _write
0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
_do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 - have
6150030 bytes in 1 extents

More logs here: http://pastebin.com/74WLzFYw



On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:

>On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
>> After pulling master branch on Friday I start seeing odd fio behavior, I
>> see a lot of reads while writing and very low performance no matter
>> whether it read or write workload.
>>
>> Output from sequential 1M write:
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>avgrq-sz
>> avgqu-sz   await r_await w_await  svctm  %util
>>
>> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
>>16.99
>>     0.28    0.78    0.00    0.78   0.76  27.60
>> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
>>33.29
>>     0.18    0.24    0.42    0.07   0.23  16.80
>>
>>
>>
>> block.db -> /dev/sdd
>> block -> /dev/sde
>>
>> health HEALTH_OK
>> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
>>        election epoch 3, quorum 0 a
>> osdmap e7: 1 osds: 1 up, 1 in
>>        flags sortbitwise
>> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
>>        8210 MB used, 178 GB / 186 GB avail
>>              64 active+clean
>> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
>>
>>
>> While on earlier revision(c1e41af) everything looks as expected:
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>avgrq-sz
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
>>65.93
>>     1.05    1.55    0.00    1.55   1.18  80.00
>> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
>> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
>>
>> Other observation, may be related to the issue, is that CPU load is
>> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest is
>>idle.
>> Looks like all load goes to single thread pool shard, earlier CPU was
>>well
>> balanced.
>
>Hmm.  Can you capture a log with debug bluestore = 20 and debug bdev = 20?
>
>Thanks!
>sage
>
>
>>
>>
>> ‹
>> Evgeniy
>>
>>
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>>is intended only for the use of the designated recipient(s) named above.
>>If the reader of this message is not the intended recipient, you are
>>hereby notified that you have received this message in error and that
>>any review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 18:14   ` Evgeniy Firsov
@ 2016-03-30 18:37     ` Jason Dillaman
  2016-03-30 19:00       ` Evgeniy Firsov
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Dillaman @ 2016-03-30 18:37 UTC (permalink / raw)
  To: Evgeniy Firsov; +Cc: Sage Weil, ceph-devel

How large is your RBD image?  100 terabytes? 

-- 

Jason Dillaman 


----- Original Message -----
> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> To: "Sage Weil" <sage@newdream.net>
> Cc: ceph-devel@vger.kernel.org
> Sent: Wednesday, March 30, 2016 2:14:12 PM
> Subject: Re: reads while 100% write
> 
> These are suspicious lines:
> 
> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0) read
> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 6144018~6012 =
> 6012
> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0) read
> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> _do_read 8210~4096 size 6150030
> 2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block) read
> 8003854336~8192
> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0) read
> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 = 4096
> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0) _write
> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 - have
> 6150030 bytes in 1 extents
> 
> More logs here: http://pastebin.com/74WLzFYw
> 
> 
> 
> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
> 
> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> >> After pulling master branch on Friday I start seeing odd fio behavior, I
> >> see a lot of reads while writing and very low performance no matter
> >> whether it read or write workload.
> >>
> >> Output from sequential 1M write:
> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> >>avgrq-sz
> >> avgqu-sz   await r_await w_await  svctm  %util
> >>
> >> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
> >>16.99
> >>     0.28    0.78    0.00    0.78   0.76  27.60
> >> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
> >>33.29
> >>     0.18    0.24    0.42    0.07   0.23  16.80
> >>
> >>
> >>
> >> block.db -> /dev/sdd
> >> block -> /dev/sde
> >>
> >> health HEALTH_OK
> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> >>        election epoch 3, quorum 0 a
> >> osdmap e7: 1 osds: 1 up, 1 in
> >>        flags sortbitwise
> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> >>        8210 MB used, 178 GB / 186 GB avail
> >>              64 active+clean
> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> >>
> >>
> >> While on earlier revision(c1e41af) everything looks as expected:
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> >>avgrq-sz
> >> avgqu-sz   await r_await w_await  svctm  %util
> >> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
> >>65.93
> >>     1.05    1.55    0.00    1.55   1.18  80.00
> >> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> >>
> >> Other observation, may be related to the issue, is that CPU load is
> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest is
> >>idle.
> >> Looks like all load goes to single thread pool shard, earlier CPU was
> >>well
> >> balanced.
> >
> >Hmm.  Can you capture a log with debug bluestore = 20 and debug bdev = 20?
> >
> >Thanks!
> >sage
> >
> >
> >>
> >>
> >> ‹
> >> Evgeniy
> >>
> >>
> >>
> >> PLEASE NOTE: The information contained in this electronic mail message
> >>is intended only for the use of the designated recipient(s) named above.
> >>If the reader of this message is not the intended recipient, you are
> >>hereby notified that you have received this message in error and that
> >>any review, dissemination, distribution, or copying of this message is
> >>strictly prohibited. If you have received this communication in error,
> >>please notify the sender by telephone or e-mail (as shown above)
> >>immediately and destroy any and all copies of this message in your
> >>possession (whether hard copies or electronically stored copies).
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 18:37     ` Jason Dillaman
@ 2016-03-30 19:00       ` Evgeniy Firsov
  2016-03-30 19:10         ` Jason Dillaman
  0 siblings, 1 reply; 16+ messages in thread
From: Evgeniy Firsov @ 2016-03-30 19:00 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Sage Weil, ceph-devel@vger.kernel.org

1.5T in that run.
With 150G behavior is the same. Except it says "_do_read 0~18 size 615030”
instead of 6M.

Also when random 4k write starts there are more reads then writes:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util

sdd               0.00  1887.00    0.00  344.00     0.00  8924.00    51.88
    0.36    1.06    0.00    1.06   0.91  31.20
sde              30.00     0.00   30.00  957.00 18120.00  3828.00    44.47
    0.25    0.26    3.87    0.14   0.17  16.40

Logs: http://pastebin.com/gGzfR5ez


On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:

>How large is your RBD image?  100 terabytes?
>
>--
>
>Jason Dillaman
>
>
>----- Original Message -----
>> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>> To: "Sage Weil" <sage@newdream.net>
>> Cc: ceph-devel@vger.kernel.org
>> Sent: Wednesday, March 30, 2016 2:14:12 PM
>> Subject: Re: reads while 100% write
>>
>> These are suspicious lines:
>>
>> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0) read
>> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 6144018~6012 =
>> 6012
>> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0) read
>> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
>> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
>> _do_read 8210~4096 size 6150030
>> 2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block) read
>> 8003854336~8192
>> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0) read
>> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 =
>>4096
>> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
>>_write
>> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
>> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
>> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 -
>>have
>> 6150030 bytes in 1 extents
>>
>> More logs here: http://pastebin.com/74WLzFYw
>>
>>
>>
>> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
>>
>> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
>> >> After pulling master branch on Friday I start seeing odd fio
>>behavior, I
>> >> see a lot of reads while writing and very low performance no matter
>> >> whether it read or write workload.
>> >>
>> >> Output from sequential 1M write:
>> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> >>avgrq-sz
>> >> avgqu-sz   await r_await w_await  svctm  %util
>> >>
>> >> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
>> >>16.99
>> >>     0.28    0.78    0.00    0.78   0.76  27.60
>> >> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
>> >>33.29
>> >>     0.18    0.24    0.42    0.07   0.23  16.80
>> >>
>> >>
>> >>
>> >> block.db -> /dev/sdd
>> >> block -> /dev/sde
>> >>
>> >> health HEALTH_OK
>> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
>> >>        election epoch 3, quorum 0 a
>> >> osdmap e7: 1 osds: 1 up, 1 in
>> >>        flags sortbitwise
>> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
>> >>        8210 MB used, 178 GB / 186 GB avail
>> >>              64 active+clean
>> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
>> >>
>> >>
>> >> While on earlier revision(c1e41af) everything looks as expected:
>> >>
>> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> >>avgrq-sz
>> >> avgqu-sz   await r_await w_await  svctm  %util
>> >> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
>> >>65.93
>> >>     1.05    1.55    0.00    1.55   1.18  80.00
>> >> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
>> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
>> >>
>> >> Other observation, may be related to the issue, is that CPU load is
>> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest is
>> >>idle.
>> >> Looks like all load goes to single thread pool shard, earlier CPU was
>> >>well
>> >> balanced.
>> >
>> >Hmm.  Can you capture a log with debug bluestore = 20 and debug bdev =
>>20?
>> >
>> >Thanks!
>> >sage
>> >
>> >
>> >>
>> >>
>> >> ‹
>> >> Evgeniy
>> >>
>> >>
>> >>
>> >> PLEASE NOTE: The information contained in this electronic mail
>>message
>> >>is intended only for the use of the designated recipient(s) named
>>above.
>> >>If the reader of this message is not the intended recipient, you are
>> >>hereby notified that you have received this message in error and that
>> >>any review, dissemination, distribution, or copying of this message is
>> >>strictly prohibited. If you have received this communication in error,
>> >>please notify the sender by telephone or e-mail (as shown above)
>> >>immediately and destroy any and all copies of this message in your
>> >>possession (whether hard copies or electronically stored copies).
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe
>>ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>>is
>> intended only for the use of the designated recipient(s) named above.
>>If the
>> reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any
>>review,
>> dissemination, distribution, or copying of this message is strictly
>> prohibited. If you have received this communication in error, please
>>notify
>> the sender by telephone or e-mail (as shown above) immediately and
>>destroy
>> any and all copies of this message in your possession (whether hard
>>copies
>> or electronically stored copies).
>>
>>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v
>>���w�j�m��������zZ+��ݢj"��

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 19:00       ` Evgeniy Firsov
@ 2016-03-30 19:10         ` Jason Dillaman
  2016-03-30 19:55           ` Sage Weil
  2016-03-30 20:39           ` Evgeniy Firsov
  0 siblings, 2 replies; 16+ messages in thread
From: Jason Dillaman @ 2016-03-30 19:10 UTC (permalink / raw)
  To: Evgeniy Firsov; +Cc: Sage Weil, ceph-devel

Are you using the RBD default of 4MB object sizes or are you using something much smaller like 64KB?  An object map of that size should be tracking up to 24,576,000 objects.  When you ran your test before, did you have the RBD object map disabled?  This definitely seems to be a use case where the lack of a cache in front of BlueStore is hurting small IO.

-- 

Jason Dillaman 


----- Original Message -----
> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> To: "Jason Dillaman" <dillaman@redhat.com>
> Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
> Sent: Wednesday, March 30, 2016 3:00:47 PM
> Subject: Re: reads while 100% write
> 
> 1.5T in that run.
> With 150G behavior is the same. Except it says "_do_read 0~18 size 615030”
> instead of 6M.
> 
> Also when random 4k write starts there are more reads then writes:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> 
> sdd               0.00  1887.00    0.00  344.00     0.00  8924.00    51.88
>     0.36    1.06    0.00    1.06   0.91  31.20
> sde              30.00     0.00   30.00  957.00 18120.00  3828.00    44.47
>     0.25    0.26    3.87    0.14   0.17  16.40
> 
> Logs: http://pastebin.com/gGzfR5ez
> 
> 
> On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
> 
> >How large is your RBD image?  100 terabytes?
> >
> >--
> >
> >Jason Dillaman
> >
> >
> >----- Original Message -----
> >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> >> To: "Sage Weil" <sage@newdream.net>
> >> Cc: ceph-devel@vger.kernel.org
> >> Sent: Wednesday, March 30, 2016 2:14:12 PM
> >> Subject: Re: reads while 100% write
> >>
> >> These are suspicious lines:
> >>
> >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0) read
> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 6144018~6012 =
> >> 6012
> >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0) read
> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> >> _do_read 8210~4096 size 6150030
> >> 2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block) read
> >> 8003854336~8192
> >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0) read
> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 =
> >>4096
> >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
> >>_write
> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 -
> >>have
> >> 6150030 bytes in 1 extents
> >>
> >> More logs here: http://pastebin.com/74WLzFYw
> >>
> >>
> >>
> >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
> >>
> >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> >> >> After pulling master branch on Friday I start seeing odd fio
> >>behavior, I
> >> >> see a lot of reads while writing and very low performance no matter
> >> >> whether it read or write workload.
> >> >>
> >> >> Output from sequential 1M write:
> >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> >> >>avgrq-sz
> >> >> avgqu-sz   await r_await w_await  svctm  %util
> >> >>
> >> >> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
> >> >>16.99
> >> >>     0.28    0.78    0.00    0.78   0.76  27.60
> >> >> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
> >> >>33.29
> >> >>     0.18    0.24    0.42    0.07   0.23  16.80
> >> >>
> >> >>
> >> >>
> >> >> block.db -> /dev/sdd
> >> >> block -> /dev/sde
> >> >>
> >> >> health HEALTH_OK
> >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> >> >>        election epoch 3, quorum 0 a
> >> >> osdmap e7: 1 osds: 1 up, 1 in
> >> >>        flags sortbitwise
> >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> >> >>        8210 MB used, 178 GB / 186 GB avail
> >> >>              64 active+clean
> >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> >> >>
> >> >>
> >> >> While on earlier revision(c1e41af) everything looks as expected:
> >> >>
> >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> >> >>avgrq-sz
> >> >> avgqu-sz   await r_await w_await  svctm  %util
> >> >> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
> >> >>65.93
> >> >>     1.05    1.55    0.00    1.55   1.18  80.00
> >> >> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
> >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> >> >>
> >> >> Other observation, may be related to the issue, is that CPU load is
> >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest is
> >> >>idle.
> >> >> Looks like all load goes to single thread pool shard, earlier CPU was
> >> >>well
> >> >> balanced.
> >> >
> >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug bdev =
> >>20?
> >> >
> >> >Thanks!
> >> >sage
> >> >
> >> >
> >> >>
> >> >>
> >> >> ‹
> >> >> Evgeniy
> >> >>
> >> >>
> >> >>
> >> >> PLEASE NOTE: The information contained in this electronic mail
> >>message
> >> >>is intended only for the use of the designated recipient(s) named
> >>above.
> >> >>If the reader of this message is not the intended recipient, you are
> >> >>hereby notified that you have received this message in error and that
> >> >>any review, dissemination, distribution, or copying of this message is
> >> >>strictly prohibited. If you have received this communication in error,
> >> >>please notify the sender by telephone or e-mail (as shown above)
> >> >>immediately and destroy any and all copies of this message in your
> >> >>possession (whether hard copies or electronically stored copies).
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe
> >>ceph-devel" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >>
> >> PLEASE NOTE: The information contained in this electronic mail message
> >>is
> >> intended only for the use of the designated recipient(s) named above.
> >>If the
> >> reader of this message is not the intended recipient, you are hereby
> >> notified that you have received this message in error and that any
> >>review,
> >> dissemination, distribution, or copying of this message is strictly
> >> prohibited. If you have received this communication in error, please
> >>notify
> >> the sender by telephone or e-mail (as shown above) immediately and
> >>destroy
> >> any and all copies of this message in your possession (whether hard
> >>copies
> >> or electronically stored copies).
> >>
> >>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v
> >>���w�j�m��������zZ+��ݢj"��
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 19:10         ` Jason Dillaman
@ 2016-03-30 19:55           ` Sage Weil
  2016-03-30 19:59             ` Jason Dillaman
  2016-03-30 20:39           ` Evgeniy Firsov
  1 sibling, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-03-30 19:55 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Evgeniy Firsov, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8407 bytes --]

On Wed, 30 Mar 2016, Jason Dillaman wrote:
> Are you using the RBD default of 4MB object sizes or are you using 
> something much smaller like 64KB?  An object map of that size should be 
> tracking up to 24,576,000 objects.  When you ran your test before, did 
> you have the RBD object map disabled?  This definitely seems to be a use 
> case where the lack of a cache in front of BlueStore is hurting small 
> IO.

Using the rados cache hint WILLNEED is probably appropriate here..

sage

> 
> -- 
> 
> Jason Dillaman 
> 
> 
> ----- Original Message -----
> > From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > To: "Jason Dillaman" <dillaman@redhat.com>
> > Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
> > Sent: Wednesday, March 30, 2016 3:00:47 PM
> > Subject: Re: reads while 100% write
> > 
> > 1.5T in that run.
> > With 150G behavior is the same. Except it says "_do_read 0~18 size 615030”
> > instead of 6M.
> > 
> > Also when random 4k write starts there are more reads then writes:
> > 
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> > avgqu-sz   await r_await w_await  svctm  %util
> > 
> > sdd               0.00  1887.00    0.00  344.00     0.00  8924.00    51.88
> >     0.36    1.06    0.00    1.06   0.91  31.20
> > sde              30.00     0.00   30.00  957.00 18120.00  3828.00    44.47
> >     0.25    0.26    3.87    0.14   0.17  16.40
> > 
> > Logs: http://pastebin.com/gGzfR5ez
> > 
> > 
> > On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
> > 
> > >How large is your RBD image?  100 terabytes?
> > >
> > >--
> > >
> > >Jason Dillaman
> > >
> > >
> > >----- Original Message -----
> > >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > >> To: "Sage Weil" <sage@newdream.net>
> > >> Cc: ceph-devel@vger.kernel.org
> > >> Sent: Wednesday, March 30, 2016 2:14:12 PM
> > >> Subject: Re: reads while 100% write
> > >>
> > >> These are suspicious lines:
> > >>
> > >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0) read
> > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 6144018~6012 =
> > >> 6012
> > >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0) read
> > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> > >> _do_read 8210~4096 size 6150030
> > >> 2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block) read
> > >> 8003854336~8192
> > >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0) read
> > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 =
> > >>4096
> > >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
> > >>_write
> > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> > >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 -
> > >>have
> > >> 6150030 bytes in 1 extents
> > >>
> > >> More logs here: http://pastebin.com/74WLzFYw
> > >>
> > >>
> > >>
> > >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
> > >>
> > >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> > >> >> After pulling master branch on Friday I start seeing odd fio
> > >>behavior, I
> > >> >> see a lot of reads while writing and very low performance no matter
> > >> >> whether it read or write workload.
> > >> >>
> > >> >> Output from sequential 1M write:
> > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > >> >>avgrq-sz
> > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > >> >>
> > >> >> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
> > >> >>16.99
> > >> >>     0.28    0.78    0.00    0.78   0.76  27.60
> > >> >> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
> > >> >>33.29
> > >> >>     0.18    0.24    0.42    0.07   0.23  16.80
> > >> >>
> > >> >>
> > >> >>
> > >> >> block.db -> /dev/sdd
> > >> >> block -> /dev/sde
> > >> >>
> > >> >> health HEALTH_OK
> > >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> > >> >>        election epoch 3, quorum 0 a
> > >> >> osdmap e7: 1 osds: 1 up, 1 in
> > >> >>        flags sortbitwise
> > >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> > >> >>        8210 MB used, 178 GB / 186 GB avail
> > >> >>              64 active+clean
> > >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> > >> >>
> > >> >>
> > >> >> While on earlier revision(c1e41af) everything looks as expected:
> > >> >>
> > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > >> >>avgrq-sz
> > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > >> >> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
> > >> >>65.93
> > >> >>     1.05    1.55    0.00    1.55   1.18  80.00
> > >> >> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
> > >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> > >> >>
> > >> >> Other observation, may be related to the issue, is that CPU load is
> > >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest is
> > >> >>idle.
> > >> >> Looks like all load goes to single thread pool shard, earlier CPU was
> > >> >>well
> > >> >> balanced.
> > >> >
> > >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug bdev =
> > >>20?
> > >> >
> > >> >Thanks!
> > >> >sage
> > >> >
> > >> >
> > >> >>
> > >> >>
> > >> >> ‹
> > >> >> Evgeniy
> > >> >>
> > >> >>
> > >> >>
> > >> >> PLEASE NOTE: The information contained in this electronic mail
> > >>message
> > >> >>is intended only for the use of the designated recipient(s) named
> > >>above.
> > >> >>If the reader of this message is not the intended recipient, you are
> > >> >>hereby notified that you have received this message in error and that
> > >> >>any review, dissemination, distribution, or copying of this message is
> > >> >>strictly prohibited. If you have received this communication in error,
> > >> >>please notify the sender by telephone or e-mail (as shown above)
> > >> >>immediately and destroy any and all copies of this message in your
> > >> >>possession (whether hard copies or electronically stored copies).
> > >> >> --
> > >> >> To unsubscribe from this list: send the line "unsubscribe
> > >>ceph-devel" in
> > >> >> the body of a message to majordomo@vger.kernel.org
> > >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >> >>
> > >> >>
> > >>
> > >> PLEASE NOTE: The information contained in this electronic mail message
> > >>is
> > >> intended only for the use of the designated recipient(s) named above.
> > >>If the
> > >> reader of this message is not the intended recipient, you are hereby
> > >> notified that you have received this message in error and that any
> > >>review,
> > >> dissemination, distribution, or copying of this message is strictly
> > >> prohibited. If you have received this communication in error, please
> > >>notify
> > >> the sender by telephone or e-mail (as shown above) immediately and
> > >>destroy
> > >> any and all copies of this message in your possession (whether hard
> > >>copies
> > >> or electronically stored copies).
> > >>
> > >>N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v
> > >>?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If the
> > reader of this message is not the intended recipient, you are hereby
> > notified that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly
> > prohibited. If you have received this communication in error, please notify
> > the sender by telephone or e-mail (as shown above) immediately and destroy
> > any and all copies of this message in your possession (whether hard copies
> > or electronically stored copies).
> > N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v?????????w???j???m????????????????????????zZ+??????ݢj"??????
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 19:55           ` Sage Weil
@ 2016-03-30 19:59             ` Jason Dillaman
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Dillaman @ 2016-03-30 19:59 UTC (permalink / raw)
  To: Sage Weil; +Cc: Evgeniy Firsov, ceph-devel

This IO is being performed within an OSD class method.  I can add a new cls_cxx_read2 method to accept cache hints and update the associated object map methods.  Would this apply to writes as well?

-- 

Jason Dillaman 


----- Original Message -----
> From: "Sage Weil" <sage@newdream.net>
> To: "Jason Dillaman" <dillaman@redhat.com>
> Cc: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>, ceph-devel@vger.kernel.org
> Sent: Wednesday, March 30, 2016 3:55:14 PM
> Subject: Re: reads while 100% write
> 
> On Wed, 30 Mar 2016, Jason Dillaman wrote:
> > Are you using the RBD default of 4MB object sizes or are you using
> > something much smaller like 64KB?  An object map of that size should be
> > tracking up to 24,576,000 objects.  When you ran your test before, did
> > you have the RBD object map disabled?  This definitely seems to be a use
> > case where the lack of a cache in front of BlueStore is hurting small
> > IO.
> 
> Using the rados cache hint WILLNEED is probably appropriate here..
> 
> sage
> 
> > 
> > --
> > 
> > Jason Dillaman
> > 
> > 
> > ----- Original Message -----
> > > From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > > To: "Jason Dillaman" <dillaman@redhat.com>
> > > Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
> > > Sent: Wednesday, March 30, 2016 3:00:47 PM
> > > Subject: Re: reads while 100% write
> > > 
> > > 1.5T in that run.
> > > With 150G behavior is the same. Except it says "_do_read 0~18 size
> > > 615030”
> > > instead of 6M.
> > > 
> > > Also when random 4k write starts there are more reads then writes:
> > > 
> > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > avgrq-sz
> > > avgqu-sz   await r_await w_await  svctm  %util
> > > 
> > > sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
> > > 51.88
> > >     0.36    1.06    0.00    1.06   0.91  31.20
> > > sde              30.00     0.00   30.00  957.00 18120.00  3828.00
> > > 44.47
> > >     0.25    0.26    3.87    0.14   0.17  16.40
> > > 
> > > Logs: http://pastebin.com/gGzfR5ez
> > > 
> > > 
> > > On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
> > > 
> > > >How large is your RBD image?  100 terabytes?
> > > >
> > > >--
> > > >
> > > >Jason Dillaman
> > > >
> > > >
> > > >----- Original Message -----
> > > >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > > >> To: "Sage Weil" <sage@newdream.net>
> > > >> Cc: ceph-devel@vger.kernel.org
> > > >> Sent: Wednesday, March 30, 2016 2:14:12 PM
> > > >> Subject: Re: reads while 100% write
> > > >>
> > > >> These are suspicious lines:
> > > >>
> > > >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > >> read
> > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 6144018~6012
> > > >> =
> > > >> 6012
> > > >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > >> read
> > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > >> _do_read 8210~4096 size 6150030
> > > >> 2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block)
> > > >> read
> > > >> 8003854336~8192
> > > >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > >> read
> > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 =
> > > >>4096
> > > >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > >>_write
> > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 -
> > > >>have
> > > >> 6150030 bytes in 1 extents
> > > >>
> > > >> More logs here: http://pastebin.com/74WLzFYw
> > > >>
> > > >>
> > > >>
> > > >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
> > > >>
> > > >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> > > >> >> After pulling master branch on Friday I start seeing odd fio
> > > >>behavior, I
> > > >> >> see a lot of reads while writing and very low performance no matter
> > > >> >> whether it read or write workload.
> > > >> >>
> > > >> >> Output from sequential 1M write:
> > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > >> >>avgrq-sz
> > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > >> >>
> > > >> >> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
> > > >> >>16.99
> > > >> >>     0.28    0.78    0.00    0.78   0.76  27.60
> > > >> >> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
> > > >> >>33.29
> > > >> >>     0.18    0.24    0.42    0.07   0.23  16.80
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> block.db -> /dev/sdd
> > > >> >> block -> /dev/sde
> > > >> >>
> > > >> >> health HEALTH_OK
> > > >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> > > >> >>        election epoch 3, quorum 0 a
> > > >> >> osdmap e7: 1 osds: 1 up, 1 in
> > > >> >>        flags sortbitwise
> > > >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> > > >> >>        8210 MB used, 178 GB / 186 GB avail
> > > >> >>              64 active+clean
> > > >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> > > >> >>
> > > >> >>
> > > >> >> While on earlier revision(c1e41af) everything looks as expected:
> > > >> >>
> > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > >> >>avgrq-sz
> > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > >> >> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
> > > >> >>65.93
> > > >> >>     1.05    1.55    0.00    1.55   1.18  80.00
> > > >> >> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
> > > >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> > > >> >>
> > > >> >> Other observation, may be related to the issue, is that CPU load is
> > > >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest
> > > >> >> is
> > > >> >>idle.
> > > >> >> Looks like all load goes to single thread pool shard, earlier CPU
> > > >> >> was
> > > >> >>well
> > > >> >> balanced.
> > > >> >
> > > >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug bdev
> > > >> >=
> > > >>20?
> > > >> >
> > > >> >Thanks!
> > > >> >sage
> > > >> >
> > > >> >
> > > >> >>
> > > >> >>
> > > >> >> ‹
> > > >> >> Evgeniy
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> PLEASE NOTE: The information contained in this electronic mail
> > > >>message
> > > >> >>is intended only for the use of the designated recipient(s) named
> > > >>above.
> > > >> >>If the reader of this message is not the intended recipient, you are
> > > >> >>hereby notified that you have received this message in error and
> > > >> >>that
> > > >> >>any review, dissemination, distribution, or copying of this message
> > > >> >>is
> > > >> >>strictly prohibited. If you have received this communication in
> > > >> >>error,
> > > >> >>please notify the sender by telephone or e-mail (as shown above)
> > > >> >>immediately and destroy any and all copies of this message in your
> > > >> >>possession (whether hard copies or electronically stored copies).
> > > >> >> --
> > > >> >> To unsubscribe from this list: send the line "unsubscribe
> > > >>ceph-devel" in
> > > >> >> the body of a message to majordomo@vger.kernel.org
> > > >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >> >>
> > > >> >>
> > > >>
> > > >> PLEASE NOTE: The information contained in this electronic mail message
> > > >>is
> > > >> intended only for the use of the designated recipient(s) named above.
> > > >>If the
> > > >> reader of this message is not the intended recipient, you are hereby
> > > >> notified that you have received this message in error and that any
> > > >>review,
> > > >> dissemination, distribution, or copying of this message is strictly
> > > >> prohibited. If you have received this communication in error, please
> > > >>notify
> > > >> the sender by telephone or e-mail (as shown above) immediately and
> > > >>destroy
> > > >> any and all copies of this message in your possession (whether hard
> > > >>copies
> > > >> or electronically stored copies).
> > > >>
> > > >>N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v
> > > >>?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > 
> > > PLEASE NOTE: The information contained in this electronic mail message is
> > > intended only for the use of the designated recipient(s) named above. If
> > > the
> > > reader of this message is not the intended recipient, you are hereby
> > > notified that you have received this message in error and that any
> > > review,
> > > dissemination, distribution, or copying of this message is strictly
> > > prohibited. If you have received this communication in error, please
> > > notify
> > > the sender by telephone or e-mail (as shown above) immediately and
> > > destroy
> > > any and all copies of this message in your possession (whether hard
> > > copies
> > > or electronically stored copies).
> > > N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > 
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 19:10         ` Jason Dillaman
  2016-03-30 19:55           ` Sage Weil
@ 2016-03-30 20:39           ` Evgeniy Firsov
  2016-03-30 20:47             ` Jason Dillaman
  1 sibling, 1 reply; 16+ messages in thread
From: Evgeniy Firsov @ 2016-03-30 20:39 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Sage Weil, ceph-devel@vger.kernel.org

I use 64K.
Explicit settings are identical for both revisions.

Looks like the following change slows down performance 10 times:

-OPTION(rbd_default_features, OPT_INT, 3) // only applies to format 2
images
-                                        // +1 for layering, +2 for
stripingv2,
-                                        // +4 for exclusive lock, +8 for
object map
+OPTION(rbd_default_features, OPT_INT, 61)   // only applies to format 2
images
+                                           // +1 for layering, +2 for
stripingv2,
+                                           // +4 for exclusive lock, +8
for object map
+                                           // +16 for fast-diff, +32 for
deep-flatten,
+                                           // +64 for journaling



On 3/30/16, 12:10 PM, "Jason Dillaman" <dillaman@redhat.com> wrote:

>Are you using the RBD default of 4MB object sizes or are you using
>something much smaller like 64KB?  An object map of that size should be
>tracking up to 24,576,000 objects.  When you ran your test before, did
>you have the RBD object map disabled?  This definitely seems to be a use
>case where the lack of a cache in front of BlueStore is hurting small IO.
>
>--
>
>Jason Dillaman
>
>
>----- Original Message -----
>> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>> To: "Jason Dillaman" <dillaman@redhat.com>
>> Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
>> Sent: Wednesday, March 30, 2016 3:00:47 PM
>> Subject: Re: reads while 100% write
>>
>> 1.5T in that run.
>> With 150G behavior is the same. Except it says "_do_read 0~18 size
>>615030”
>> instead of 6M.
>>
>> Also when random 4k write starts there are more reads then writes:
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>avgrq-sz
>> avgqu-sz   await r_await w_await  svctm  %util
>>
>> sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
>>51.88
>>     0.36    1.06    0.00    1.06   0.91  31.20
>> sde              30.00     0.00   30.00  957.00 18120.00  3828.00
>>44.47
>>     0.25    0.26    3.87    0.14   0.17  16.40
>>
>> Logs: http://pastebin.com/gGzfR5ez
>>
>>
>> On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
>>
>> >How large is your RBD image?  100 terabytes?
>> >
>> >--
>> >
>> >Jason Dillaman
>> >
>> >
>> >----- Original Message -----
>> >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>> >> To: "Sage Weil" <sage@newdream.net>
>> >> Cc: ceph-devel@vger.kernel.org
>> >> Sent: Wednesday, March 30, 2016 2:14:12 PM
>> >> Subject: Re: reads while 100% write
>> >>
>> >> These are suspicious lines:
>> >>
>> >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
>>read
>> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>6144018~6012 =
>> >> 6012
>> >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
>>read
>> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
>> >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
>> >> _do_read 8210~4096 size 6150030
>> >> 2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block)
>>read
>> >> 8003854336~8192
>> >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
>>read
>> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 =
>> >>4096
>> >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
>> >>_write
>> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
>> >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
>> >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 -
>> >>have
>> >> 6150030 bytes in 1 extents
>> >>
>> >> More logs here: http://pastebin.com/74WLzFYw
>> >>
>> >>
>> >>
>> >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
>> >>
>> >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
>> >> >> After pulling master branch on Friday I start seeing odd fio
>> >>behavior, I
>> >> >> see a lot of reads while writing and very low performance no
>>matter
>> >> >> whether it read or write workload.
>> >> >>
>> >> >> Output from sequential 1M write:
>> >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> >> >>avgrq-sz
>> >> >> avgqu-sz   await r_await w_await  svctm  %util
>> >> >>
>> >> >> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
>> >> >>16.99
>> >> >>     0.28    0.78    0.00    0.78   0.76  27.60
>> >> >> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
>> >> >>33.29
>> >> >>     0.18    0.24    0.42    0.07   0.23  16.80
>> >> >>
>> >> >>
>> >> >>
>> >> >> block.db -> /dev/sdd
>> >> >> block -> /dev/sde
>> >> >>
>> >> >> health HEALTH_OK
>> >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
>> >> >>        election epoch 3, quorum 0 a
>> >> >> osdmap e7: 1 osds: 1 up, 1 in
>> >> >>        flags sortbitwise
>> >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
>> >> >>        8210 MB used, 178 GB / 186 GB avail
>> >> >>              64 active+clean
>> >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
>> >> >>
>> >> >>
>> >> >> While on earlier revision(c1e41af) everything looks as expected:
>> >> >>
>> >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> >> >>avgrq-sz
>> >> >> avgqu-sz   await r_await w_await  svctm  %util
>> >> >> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
>> >> >>65.93
>> >> >>     1.05    1.55    0.00    1.55   1.18  80.00
>> >> >> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
>> >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
>> >> >>
>> >> >> Other observation, may be related to the issue, is that CPU load
>>is
>> >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the
>>rest is
>> >> >>idle.
>> >> >> Looks like all load goes to single thread pool shard, earlier CPU
>>was
>> >> >>well
>> >> >> balanced.
>> >> >
>> >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug
>>bdev =
>> >>20?
>> >> >
>> >> >Thanks!
>> >> >sage
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> ‹
>> >> >> Evgeniy
>> >> >>
>> >> >>
>> >> >>
>> >> >> PLEASE NOTE: The information contained in this electronic mail
>> >>message
>> >> >>is intended only for the use of the designated recipient(s) named
>> >>above.
>> >> >>If the reader of this message is not the intended recipient, you
>>are
>> >> >>hereby notified that you have received this message in error and
>>that
>> >> >>any review, dissemination, distribution, or copying of this
>>message is
>> >> >>strictly prohibited. If you have received this communication in
>>error,
>> >> >>please notify the sender by telephone or e-mail (as shown above)
>> >> >>immediately and destroy any and all copies of this message in your
>> >> >>possession (whether hard copies or electronically stored copies).
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe
>> >>ceph-devel" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> >>
>> >>
>> >> PLEASE NOTE: The information contained in this electronic mail
>>message
>> >>is
>> >> intended only for the use of the designated recipient(s) named above.
>> >>If the
>> >> reader of this message is not the intended recipient, you are hereby
>> >> notified that you have received this message in error and that any
>> >>review,
>> >> dissemination, distribution, or copying of this message is strictly
>> >> prohibited. If you have received this communication in error, please
>> >>notify
>> >> the sender by telephone or e-mail (as shown above) immediately and
>> >>destroy
>> >> any and all copies of this message in your possession (whether hard
>> >>copies
>> >> or electronically stored copies).
>> >>
>>
>>>>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:
>>>>+v
>> >>���w�j�m��������zZ+��ݢj"��
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>>is
>> intended only for the use of the designated recipient(s) named above.
>>If the
>> reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any
>>review,
>> dissemination, distribution, or copying of this message is strictly
>> prohibited. If you have received this communication in error, please
>>notify
>> the sender by telephone or e-mail (as shown above) immediately and
>>destroy
>> any and all copies of this message in your possession (whether hard
>>copies
>> or electronically stored copies).
>>
>>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v
>>���w�j�m��������zZ+��ݢj"��

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 20:39           ` Evgeniy Firsov
@ 2016-03-30 20:47             ` Jason Dillaman
  2016-03-30 21:49               ` Evgeniy Firsov
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Dillaman @ 2016-03-30 20:47 UTC (permalink / raw)
  To: Evgeniy Firsov; +Cc: Sage Weil, ceph-devel

Correct, the change for the default RBD features actually merged on March 1 as well (a7470c8), albeit a few hours after the commit you last tested against (c1e41af).  You can revert to pre-Jewel RBD features on an existing image by running the following:

# rbd feature disable <image name> exclusive-lock,object-map,fast-diff,deep-flatten

Hopefully the new PR to add the WILLNEED fadvise flag helps. 

-- 

Jason Dillaman 


----- Original Message -----
> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> To: "Jason Dillaman" <dillaman@redhat.com>
> Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
> Sent: Wednesday, March 30, 2016 4:39:09 PM
> Subject: Re: reads while 100% write
> 
> I use 64K.
> Explicit settings are identical for both revisions.
> 
> Looks like the following change slows down performance 10 times:
> 
> -OPTION(rbd_default_features, OPT_INT, 3) // only applies to format 2
> images
> -                                        // +1 for layering, +2 for
> stripingv2,
> -                                        // +4 for exclusive lock, +8 for
> object map
> +OPTION(rbd_default_features, OPT_INT, 61)   // only applies to format 2
> images
> +                                           // +1 for layering, +2 for
> stripingv2,
> +                                           // +4 for exclusive lock, +8
> for object map
> +                                           // +16 for fast-diff, +32 for
> deep-flatten,
> +                                           // +64 for journaling
> 
> 
> 
> On 3/30/16, 12:10 PM, "Jason Dillaman" <dillaman@redhat.com> wrote:
> 
> >Are you using the RBD default of 4MB object sizes or are you using
> >something much smaller like 64KB?  An object map of that size should be
> >tracking up to 24,576,000 objects.  When you ran your test before, did
> >you have the RBD object map disabled?  This definitely seems to be a use
> >case where the lack of a cache in front of BlueStore is hurting small IO.
> >
> >--
> >
> >Jason Dillaman
> >
> >
> >----- Original Message -----
> >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> >> To: "Jason Dillaman" <dillaman@redhat.com>
> >> Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
> >> Sent: Wednesday, March 30, 2016 3:00:47 PM
> >> Subject: Re: reads while 100% write
> >>
> >> 1.5T in that run.
> >> With 150G behavior is the same. Except it says "_do_read 0~18 size
> >>615030”
> >> instead of 6M.
> >>
> >> Also when random 4k write starts there are more reads then writes:
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> >>avgrq-sz
> >> avgqu-sz   await r_await w_await  svctm  %util
> >>
> >> sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
> >>51.88
> >>     0.36    1.06    0.00    1.06   0.91  31.20
> >> sde              30.00     0.00   30.00  957.00 18120.00  3828.00
> >>44.47
> >>     0.25    0.26    3.87    0.14   0.17  16.40
> >>
> >> Logs: http://pastebin.com/gGzfR5ez
> >>
> >>
> >> On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
> >>
> >> >How large is your RBD image?  100 terabytes?
> >> >
> >> >--
> >> >
> >> >Jason Dillaman
> >> >
> >> >
> >> >----- Original Message -----
> >> >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> >> >> To: "Sage Weil" <sage@newdream.net>
> >> >> Cc: ceph-devel@vger.kernel.org
> >> >> Sent: Wednesday, March 30, 2016 2:14:12 PM
> >> >> Subject: Re: reads while 100% write
> >> >>
> >> >> These are suspicious lines:
> >> >>
> >> >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
> >>read
> >> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
> >>6144018~6012 =
> >> >> 6012
> >> >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
> >>read
> >> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> >> >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> >> >> _do_read 8210~4096 size 6150030
> >> >> 2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block)
> >>read
> >> >> 8003854336~8192
> >> >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
> >>read
> >> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 =
> >> >>4096
> >> >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
> >> >>_write
> >> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> >> >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> >> >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 -
> >> >>have
> >> >> 6150030 bytes in 1 extents
> >> >>
> >> >> More logs here: http://pastebin.com/74WLzFYw
> >> >>
> >> >>
> >> >>
> >> >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
> >> >>
> >> >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> >> >> >> After pulling master branch on Friday I start seeing odd fio
> >> >>behavior, I
> >> >> >> see a lot of reads while writing and very low performance no
> >>matter
> >> >> >> whether it read or write workload.
> >> >> >>
> >> >> >> Output from sequential 1M write:
> >> >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> >> >> >>avgrq-sz
> >> >> >> avgqu-sz   await r_await w_await  svctm  %util
> >> >> >>
> >> >> >> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
> >> >> >>16.99
> >> >> >>     0.28    0.78    0.00    0.78   0.76  27.60
> >> >> >> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
> >> >> >>33.29
> >> >> >>     0.18    0.24    0.42    0.07   0.23  16.80
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> block.db -> /dev/sdd
> >> >> >> block -> /dev/sde
> >> >> >>
> >> >> >> health HEALTH_OK
> >> >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> >> >> >>        election epoch 3, quorum 0 a
> >> >> >> osdmap e7: 1 osds: 1 up, 1 in
> >> >> >>        flags sortbitwise
> >> >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> >> >> >>        8210 MB used, 178 GB / 186 GB avail
> >> >> >>              64 active+clean
> >> >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> >> >> >>
> >> >> >>
> >> >> >> While on earlier revision(c1e41af) everything looks as expected:
> >> >> >>
> >> >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> >> >> >>avgrq-sz
> >> >> >> avgqu-sz   await r_await w_await  svctm  %util
> >> >> >> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
> >> >> >>65.93
> >> >> >>     1.05    1.55    0.00    1.55   1.18  80.00
> >> >> >> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
> >> >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> >> >> >>
> >> >> >> Other observation, may be related to the issue, is that CPU load
> >>is
> >> >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the
> >>rest is
> >> >> >>idle.
> >> >> >> Looks like all load goes to single thread pool shard, earlier CPU
> >>was
> >> >> >>well
> >> >> >> balanced.
> >> >> >
> >> >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug
> >>bdev =
> >> >>20?
> >> >> >
> >> >> >Thanks!
> >> >> >sage
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> ‹
> >> >> >> Evgeniy
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> PLEASE NOTE: The information contained in this electronic mail
> >> >>message
> >> >> >>is intended only for the use of the designated recipient(s) named
> >> >>above.
> >> >> >>If the reader of this message is not the intended recipient, you
> >>are
> >> >> >>hereby notified that you have received this message in error and
> >>that
> >> >> >>any review, dissemination, distribution, or copying of this
> >>message is
> >> >> >>strictly prohibited. If you have received this communication in
> >>error,
> >> >> >>please notify the sender by telephone or e-mail (as shown above)
> >> >> >>immediately and destroy any and all copies of this message in your
> >> >> >>possession (whether hard copies or electronically stored copies).
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe
> >> >>ceph-devel" in
> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>
> >> >> >>
> >> >>
> >> >> PLEASE NOTE: The information contained in this electronic mail
> >>message
> >> >>is
> >> >> intended only for the use of the designated recipient(s) named above.
> >> >>If the
> >> >> reader of this message is not the intended recipient, you are hereby
> >> >> notified that you have received this message in error and that any
> >> >>review,
> >> >> dissemination, distribution, or copying of this message is strictly
> >> >> prohibited. If you have received this communication in error, please
> >> >>notify
> >> >> the sender by telephone or e-mail (as shown above) immediately and
> >> >>destroy
> >> >> any and all copies of this message in your possession (whether hard
> >> >>copies
> >> >> or electronically stored copies).
> >> >>
> >>
> >>>>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:
> >>>>+v
> >> >>���w�j�m��������zZ+��ݢj"��
> >>
> >> PLEASE NOTE: The information contained in this electronic mail message
> >>is
> >> intended only for the use of the designated recipient(s) named above.
> >>If the
> >> reader of this message is not the intended recipient, you are hereby
> >> notified that you have received this message in error and that any
> >>review,
> >> dissemination, distribution, or copying of this message is strictly
> >> prohibited. If you have received this communication in error, please
> >>notify
> >> the sender by telephone or e-mail (as shown above) immediately and
> >>destroy
> >> any and all copies of this message in your possession (whether hard
> >>copies
> >> or electronically stored copies).
> >>
> >>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v
> >>���w�j�m��������zZ+��ݢj"��
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 20:47             ` Jason Dillaman
@ 2016-03-30 21:49               ` Evgeniy Firsov
  2016-03-30 21:59                 ` Josh Durgin
  0 siblings, 1 reply; 16+ messages in thread
From: Evgeniy Firsov @ 2016-03-30 21:49 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Sage Weil, ceph-devel@vger.kernel.org

Ok, I will use rbd default features = 1 for now.
Thank you, for help.

On 3/30/16, 1:47 PM, "Jason Dillaman" <dillaman@redhat.com> wrote:

>Correct, the change for the default RBD features actually merged on March
>1 as well (a7470c8), albeit a few hours after the commit you last tested
>against (c1e41af).  You can revert to pre-Jewel RBD features on an
>existing image by running the following:
>
># rbd feature disable <image name>
>exclusive-lock,object-map,fast-diff,deep-flatten
>
>Hopefully the new PR to add the WILLNEED fadvise flag helps.
>
>--
>
>Jason Dillaman
>
>
>----- Original Message -----
>> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>> To: "Jason Dillaman" <dillaman@redhat.com>
>> Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
>> Sent: Wednesday, March 30, 2016 4:39:09 PM
>> Subject: Re: reads while 100% write
>>
>> I use 64K.
>> Explicit settings are identical for both revisions.
>>
>> Looks like the following change slows down performance 10 times:
>>
>> -OPTION(rbd_default_features, OPT_INT, 3) // only applies to format 2
>> images
>> -                                        // +1 for layering, +2 for
>> stripingv2,
>> -                                        // +4 for exclusive lock, +8
>>for
>> object map
>> +OPTION(rbd_default_features, OPT_INT, 61)   // only applies to format 2
>> images
>> +                                           // +1 for layering, +2 for
>> stripingv2,
>> +                                           // +4 for exclusive lock, +8
>> for object map
>> +                                           // +16 for fast-diff, +32
>>for
>> deep-flatten,
>> +                                           // +64 for journaling
>>
>>
>>
>> On 3/30/16, 12:10 PM, "Jason Dillaman" <dillaman@redhat.com> wrote:
>>
>> >Are you using the RBD default of 4MB object sizes or are you using
>> >something much smaller like 64KB?  An object map of that size should be
>> >tracking up to 24,576,000 objects.  When you ran your test before, did
>> >you have the RBD object map disabled?  This definitely seems to be a
>>use
>> >case where the lack of a cache in front of BlueStore is hurting small
>>IO.
>> >
>> >--
>> >
>> >Jason Dillaman
>> >
>> >
>> >----- Original Message -----
>> >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>> >> To: "Jason Dillaman" <dillaman@redhat.com>
>> >> Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
>> >> Sent: Wednesday, March 30, 2016 3:00:47 PM
>> >> Subject: Re: reads while 100% write
>> >>
>> >> 1.5T in that run.
>> >> With 150G behavior is the same. Except it says "_do_read 0~18 size
>> >>615030”
>> >> instead of 6M.
>> >>
>> >> Also when random 4k write starts there are more reads then writes:
>> >>
>> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> >>avgrq-sz
>> >> avgqu-sz   await r_await w_await  svctm  %util
>> >>
>> >> sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
>> >>51.88
>> >>     0.36    1.06    0.00    1.06   0.91  31.20
>> >> sde              30.00     0.00   30.00  957.00 18120.00  3828.00
>> >>44.47
>> >>     0.25    0.26    3.87    0.14   0.17  16.40
>> >>
>> >> Logs: http://pastebin.com/gGzfR5ez
>> >>
>> >>
>> >> On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
>> >>
>> >> >How large is your RBD image?  100 terabytes?
>> >> >
>> >> >--
>> >> >
>> >> >Jason Dillaman
>> >> >
>> >> >
>> >> >----- Original Message -----
>> >> >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>> >> >> To: "Sage Weil" <sage@newdream.net>
>> >> >> Cc: ceph-devel@vger.kernel.org
>> >> >> Sent: Wednesday, March 30, 2016 2:14:12 PM
>> >> >> Subject: Re: reads while 100% write
>> >> >>
>> >> >> These are suspicious lines:
>> >> >>
>> >> >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
>> >>read
>> >> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>> >>6144018~6012 =
>> >> >> 6012
>> >> >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
>> >>read
>> >> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
>> >> >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
>> >> >> _do_read 8210~4096 size 6150030
>> >> >> 2016-03-30 10:54:23.142267 7f2e933ff700  5
>>bdev(src/dev/osd0/block)
>> >>read
>> >> >> 8003854336~8192
>> >> >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
>> >>read
>> >> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>8210~4096 =
>> >> >>4096
>> >> >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
>> >> >>_write
>> >> >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
>> >> >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
>> >> >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>8210~4096 -
>> >> >>have
>> >> >> 6150030 bytes in 1 extents
>> >> >>
>> >> >> More logs here: http://pastebin.com/74WLzFYw
>> >> >>
>> >> >>
>> >> >>
>> >> >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
>> >> >>
>> >> >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
>> >> >> >> After pulling master branch on Friday I start seeing odd fio
>> >> >>behavior, I
>> >> >> >> see a lot of reads while writing and very low performance no
>> >>matter
>> >> >> >> whether it read or write workload.
>> >> >> >>
>> >> >> >> Output from sequential 1M write:
>> >> >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
>>wkB/s
>> >> >> >>avgrq-sz
>> >> >> >> avgqu-sz   await r_await w_await  svctm  %util
>> >> >> >>
>> >> >> >> sdd               0.00   409.00    0.00  364.00     0.00
>>3092.00
>> >> >> >>16.99
>> >> >> >>     0.28    0.78    0.00    0.78   0.76  27.60
>> >> >> >> sde               0.00   242.00  365.00  363.00  2436.00
>>9680.00
>> >> >> >>33.29
>> >> >> >>     0.18    0.24    0.42    0.07   0.23  16.80
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> block.db -> /dev/sdd
>> >> >> >> block -> /dev/sde
>> >> >> >>
>> >> >> >> health HEALTH_OK
>> >> >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
>> >> >> >>        election epoch 3, quorum 0 a
>> >> >> >> osdmap e7: 1 osds: 1 up, 1 in
>> >> >> >>        flags sortbitwise
>> >> >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
>> >> >> >>        8210 MB used, 178 GB / 186 GB avail
>> >> >> >>              64 active+clean
>> >> >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
>> >> >> >>
>> >> >> >>
>> >> >> >> While on earlier revision(c1e41af) everything looks as
>>expected:
>> >> >> >>
>> >> >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
>>wkB/s
>> >> >> >>avgrq-sz
>> >> >> >> avgqu-sz   await r_await w_await  svctm  %util
>> >> >> >> sdd               0.00  4910.00    0.00  680.00     0.00
>>22416.00
>> >> >> >>65.93
>> >> >> >>     1.05    1.55    0.00    1.55   1.18  80.00
>> >> >> >> sde               0.00     0.00    0.00 3418.00     0.00
>>217612.00
>> >> >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
>> >> >> >>
>> >> >> >> Other observation, may be related to the issue, is that CPU
>>load
>> >>is
>> >> >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the
>> >>rest is
>> >> >> >>idle.
>> >> >> >> Looks like all load goes to single thread pool shard, earlier
>>CPU
>> >>was
>> >> >> >>well
>> >> >> >> balanced.
>> >> >> >
>> >> >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug
>> >>bdev =
>> >> >>20?
>> >> >> >
>> >> >> >Thanks!
>> >> >> >sage
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> ‹
>> >> >> >> Evgeniy
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> PLEASE NOTE: The information contained in this electronic mail
>> >> >>message
>> >> >> >>is intended only for the use of the designated recipient(s)
>>named
>> >> >>above.
>> >> >> >>If the reader of this message is not the intended recipient, you
>> >>are
>> >> >> >>hereby notified that you have received this message in error and
>> >>that
>> >> >> >>any review, dissemination, distribution, or copying of this
>> >>message is
>> >> >> >>strictly prohibited. If you have received this communication in
>> >>error,
>> >> >> >>please notify the sender by telephone or e-mail (as shown above)
>> >> >> >>immediately and destroy any and all copies of this message in
>>your
>> >> >> >>possession (whether hard copies or electronically stored
>>copies).
>> >> >> >> --
>> >> >> >> To unsubscribe from this list: send the line "unsubscribe
>> >> >>ceph-devel" in
>> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> More majordomo info at
>>http://vger.kernel.org/majordomo-info.html
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >> PLEASE NOTE: The information contained in this electronic mail
>> >>message
>> >> >>is
>> >> >> intended only for the use of the designated recipient(s) named
>>above.
>> >> >>If the
>> >> >> reader of this message is not the intended recipient, you are
>>hereby
>> >> >> notified that you have received this message in error and that any
>> >> >>review,
>> >> >> dissemination, distribution, or copying of this message is
>>strictly
>> >> >> prohibited. If you have received this communication in error,
>>please
>> >> >>notify
>> >> >> the sender by telephone or e-mail (as shown above) immediately and
>> >> >>destroy
>> >> >> any and all copies of this message in your possession (whether
>>hard
>> >> >>copies
>> >> >> or electronically stored copies).
>> >> >>
>> >>
>>
>>>>>>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������
>>>>>>j:
>> >>>>+v
>> >> >>���w�j�m��������zZ+��ݢj"��
>> >>
>> >> PLEASE NOTE: The information contained in this electronic mail
>>message
>> >>is
>> >> intended only for the use of the designated recipient(s) named above.
>> >>If the
>> >> reader of this message is not the intended recipient, you are hereby
>> >> notified that you have received this message in error and that any
>> >>review,
>> >> dissemination, distribution, or copying of this message is strictly
>> >> prohibited. If you have received this communication in error, please
>> >>notify
>> >> the sender by telephone or e-mail (as shown above) immediately and
>> >>destroy
>> >> any and all copies of this message in your possession (whether hard
>> >>copies
>> >> or electronically stored copies).
>> >>
>>
>>>>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:
>>>>+v
>> >>���w�j�m��������zZ+��ݢj"��
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>>is
>> intended only for the use of the designated recipient(s) named above.
>>If the
>> reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any
>>review,
>> dissemination, distribution, or copying of this message is strictly
>> prohibited. If you have received this communication in error, please
>>notify
>> the sender by telephone or e-mail (as shown above) immediately and
>>destroy
>> any and all copies of this message in your possession (whether hard
>>copies
>> or electronically stored copies).
>>
>>N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v
>>���w�j�m��������zZ+��ݢj"��

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 21:49               ` Evgeniy Firsov
@ 2016-03-30 21:59                 ` Josh Durgin
  0 siblings, 0 replies; 16+ messages in thread
From: Josh Durgin @ 2016-03-30 21:59 UTC (permalink / raw)
  To: Evgeniy Firsov, Jason Dillaman; +Cc: Sage Weil, ceph-devel@vger.kernel.org



On 03/30/2016 02:49 PM, Evgeniy Firsov wrote:
> Ok, I will use rbd default features = 1 for now.
> Thank you, for help.

When you do start testing with object-map, keep in mind it's the writes
to empty objects that have the overhead. If you want to test 
steady-state, you may want to pre-fill the image.

Josh

> On 3/30/16, 1:47 PM, "Jason Dillaman" <dillaman@redhat.com> wrote:
>
>> Correct, the change for the default RBD features actually merged on March
>> 1 as well (a7470c8), albeit a few hours after the commit you last tested
>> against (c1e41af).  You can revert to pre-Jewel RBD features on an
>> existing image by running the following:
>>
>> # rbd feature disable <image name>
>> exclusive-lock,object-map,fast-diff,deep-flatten
>>
>> Hopefully the new PR to add the WILLNEED fadvise flag helps.
>>
>> --
>>
>> Jason Dillaman
>>
>>
>> ----- Original Message -----
>>> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>>> To: "Jason Dillaman" <dillaman@redhat.com>
>>> Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
>>> Sent: Wednesday, March 30, 2016 4:39:09 PM
>>> Subject: Re: reads while 100% write
>>>
>>> I use 64K.
>>> Explicit settings are identical for both revisions.
>>>
>>> Looks like the following change slows down performance 10 times:
>>>
>>> -OPTION(rbd_default_features, OPT_INT, 3) // only applies to format 2
>>> images
>>> -                                        // +1 for layering, +2 for
>>> stripingv2,
>>> -                                        // +4 for exclusive lock, +8
>>> for
>>> object map
>>> +OPTION(rbd_default_features, OPT_INT, 61)   // only applies to format 2
>>> images
>>> +                                           // +1 for layering, +2 for
>>> stripingv2,
>>> +                                           // +4 for exclusive lock, +8
>>> for object map
>>> +                                           // +16 for fast-diff, +32
>>> for
>>> deep-flatten,
>>> +                                           // +64 for journaling
>>>
>>>
>>>
>>> On 3/30/16, 12:10 PM, "Jason Dillaman" <dillaman@redhat.com> wrote:
>>>
>>>> Are you using the RBD default of 4MB object sizes or are you using
>>>> something much smaller like 64KB?  An object map of that size should be
>>>> tracking up to 24,576,000 objects.  When you ran your test before, did
>>>> you have the RBD object map disabled?  This definitely seems to be a
>>> use
>>>> case where the lack of a cache in front of BlueStore is hurting small
>>> IO.
>>>>
>>>> --
>>>>
>>>> Jason Dillaman
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>>>>> To: "Jason Dillaman" <dillaman@redhat.com>
>>>>> Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
>>>>> Sent: Wednesday, March 30, 2016 3:00:47 PM
>>>>> Subject: Re: reads while 100% write
>>>>>
>>>>> 1.5T in that run.
>>>>> With 150G behavior is the same. Except it says "_do_read 0~18 size
>>>>> 615030”
>>>>> instead of 6M.
>>>>>
>>>>> Also when random 4k write starts there are more reads then writes:
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>> avgrq-sz
>>>>> avgqu-sz   await r_await w_await  svctm  %util
>>>>>
>>>>> sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
>>>>> 51.88
>>>>>      0.36    1.06    0.00    1.06   0.91  31.20
>>>>> sde              30.00     0.00   30.00  957.00 18120.00  3828.00
>>>>> 44.47
>>>>>      0.25    0.26    3.87    0.14   0.17  16.40
>>>>>
>>>>> Logs: http://pastebin.com/gGzfR5ez
>>>>>
>>>>>
>>>>> On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
>>>>>
>>>>>> How large is your RBD image?  100 terabytes?
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Jason Dillaman
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>>>>>>> To: "Sage Weil" <sage@newdream.net>
>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>> Sent: Wednesday, March 30, 2016 2:14:12 PM
>>>>>>> Subject: Re: reads while 100% write
>>>>>>>
>>>>>>> These are suspicious lines:
>>>>>>>
>>>>>>> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
>>>>> read
>>>>>>> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>>>> 6144018~6012 =
>>>>>>> 6012
>>>>>>> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
>>>>> read
>>>>>>> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
>>>>>>> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
>>>>>>> _do_read 8210~4096 size 6150030
>>>>>>> 2016-03-30 10:54:23.142267 7f2e933ff700  5
>>> bdev(src/dev/osd0/block)
>>>>> read
>>>>>>> 8003854336~8192
>>>>>>> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
>>>>> read
>>>>>>> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>> 8210~4096 =
>>>>>>> 4096
>>>>>>> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
>>>>>>> _write
>>>>>>> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
>>>>>>> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
>>>>>>> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>> 8210~4096 -
>>>>>>> have
>>>>>>> 6150030 bytes in 1 extents
>>>>>>>
>>>>>>> More logs here: http://pastebin.com/74WLzFYw
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
>>>>>>>
>>>>>>>> On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
>>>>>>>>> After pulling master branch on Friday I start seeing odd fio
>>>>>>> behavior, I
>>>>>>>>> see a lot of reads while writing and very low performance no
>>>>> matter
>>>>>>>>> whether it read or write workload.
>>>>>>>>>
>>>>>>>>> Output from sequential 1M write:
>>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
>>> wkB/s
>>>>>>>>> avgrq-sz
>>>>>>>>> avgqu-sz   await r_await w_await  svctm  %util
>>>>>>>>>
>>>>>>>>> sdd               0.00   409.00    0.00  364.00     0.00
>>> 3092.00
>>>>>>>>> 16.99
>>>>>>>>>      0.28    0.78    0.00    0.78   0.76  27.60
>>>>>>>>> sde               0.00   242.00  365.00  363.00  2436.00
>>> 9680.00
>>>>>>>>> 33.29
>>>>>>>>>      0.18    0.24    0.42    0.07   0.23  16.80
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> block.db -> /dev/sdd
>>>>>>>>> block -> /dev/sde
>>>>>>>>>
>>>>>>>>> health HEALTH_OK
>>>>>>>>> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
>>>>>>>>>         election epoch 3, quorum 0 a
>>>>>>>>> osdmap e7: 1 osds: 1 up, 1 in
>>>>>>>>>         flags sortbitwise
>>>>>>>>> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
>>>>>>>>>         8210 MB used, 178 GB / 186 GB avail
>>>>>>>>>               64 active+clean
>>>>>>>>> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While on earlier revision(c1e41af) everything looks as
>>> expected:
>>>>>>>>>
>>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
>>> wkB/s
>>>>>>>>> avgrq-sz
>>>>>>>>> avgqu-sz   await r_await w_await  svctm  %util
>>>>>>>>> sdd               0.00  4910.00    0.00  680.00     0.00
>>> 22416.00
>>>>>>>>> 65.93
>>>>>>>>>      1.05    1.55    0.00    1.55   1.18  80.00
>>>>>>>>> sde               0.00     0.00    0.00 3418.00     0.00
>>> 217612.00
>>>>>>>>> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
>>>>>>>>>
>>>>>>>>> Other observation, may be related to the issue, is that CPU
>>> load
>>>>> is
>>>>>>>>> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the
>>>>> rest is
>>>>>>>>> idle.
>>>>>>>>> Looks like all load goes to single thread pool shard, earlier
>>> CPU
>>>>> was
>>>>>>>>> well
>>>>>>>>> balanced.
>>>>>>>>
>>>>>>>> Hmm.  Can you capture a log with debug bluestore = 20 and debug
>>>>> bdev =
>>>>>>> 20?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> sage
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ‹
>>>>>>>>> Evgeniy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>> message
>>>>>>>>> is intended only for the use of the designated recipient(s)
>>> named
>>>>>>> above.
>>>>>>>>> If the reader of this message is not the intended recipient, you
>>>>> are
>>>>>>>>> hereby notified that you have received this message in error and
>>>>> that
>>>>>>>>> any review, dissemination, distribution, or copying of this
>>>>> message is
>>>>>>>>> strictly prohibited. If you have received this communication in
>>>>> error,
>>>>>>>>> please notify the sender by telephone or e-mail (as shown above)
>>>>>>>>> immediately and destroy any and all copies of this message in
>>> your
>>>>>>>>> possession (whether hard copies or electronically stored
>>> copies).
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>> ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>> message
>>>>>>> is
>>>>>>> intended only for the use of the designated recipient(s) named
>>> above.
>>>>>>> If the
>>>>>>> reader of this message is not the intended recipient, you are
>>> hereby
>>>>>>> notified that you have received this message in error and that any
>>>>>>> review,
>>>>>>> dissemination, distribution, or copying of this message is
>>> strictly
>>>>>>> prohibited. If you have received this communication in error,
>>> please
>>>>>>> notify
>>>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>>>> destroy
>>>>>>> any and all copies of this message in your possession (whether
>>> hard
>>>>>>> copies
>>>>>>> or electronically stored copies).
>>>>>>>
>>>>>
>>>
>>>>>>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������
>>>>>>> j:
>>>>>>> +v
>>>>>>> ���w�j�m��������zZ+��ݢj"��
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail
>>> message
>>>>> is
>>>>> intended only for the use of the designated recipient(s) named above.
>>>>> If the
>>>>> reader of this message is not the intended recipient, you are hereby
>>>>> notified that you have received this message in error and that any
>>>>> review,
>>>>> dissemination, distribution, or copying of this message is strictly
>>>>> prohibited. If you have received this communication in error, please
>>>>> notify
>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>> destroy
>>>>> any and all copies of this message in your possession (whether hard
>>>>> copies
>>>>> or electronically stored copies).
>>>>>
>>>
>>>>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:
>>>>> +v
>>>>> ���w�j�m��������zZ+��ݢj"��
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message
>>> is
>>> intended only for the use of the designated recipient(s) named above.
>>> If the
>>> reader of this message is not the intended recipient, you are hereby
>>> notified that you have received this message in error and that any
>>> review,
>>> dissemination, distribution, or copying of this message is strictly
>>> prohibited. If you have received this communication in error, please
>>> notify
>>> the sender by telephone or e-mail (as shown above) immediately and
>>> destroy
>>> any and all copies of this message in your possession (whether hard
>>> copies
>>> or electronically stored copies).
>>>
>>> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v
>>> ���w�j�m��������zZ+��ݢj"��
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
@ 2016-03-30 20:02 Sage Weil
  2016-03-30 20:29 ` Jason Dillaman
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-03-30 20:02 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Evgeniy Firsov, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 10033 bytes --]

On Wed, 30 Mar 2016, Jason Dillaman wrote:
> This IO is being performed within an OSD class method.  I can add a new 
> cls_cxx_read2 method to accept cache hints and update the associated 
> object map methods.  Would this apply to writes as well?

Yeah, we'll want to hint them both.

s

> 
> -- 
> 
> Jason Dillaman 
> 
> 
> ----- Original Message -----
> > From: "Sage Weil" <sage@newdream.net>
> > To: "Jason Dillaman" <dillaman@redhat.com>
> > Cc: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>, ceph-devel@vger.kernel.org
> > Sent: Wednesday, March 30, 2016 3:55:14 PM
> > Subject: Re: reads while 100% write
> > 
> > On Wed, 30 Mar 2016, Jason Dillaman wrote:
> > > Are you using the RBD default of 4MB object sizes or are you using
> > > something much smaller like 64KB?  An object map of that size should be
> > > tracking up to 24,576,000 objects.  When you ran your test before, did
> > > you have the RBD object map disabled?  This definitely seems to be a use
> > > case where the lack of a cache in front of BlueStore is hurting small
> > > IO.
> > 
> > Using the rados cache hint WILLNEED is probably appropriate here..
> > 
> > sage
> > 
> > > 
> > > --
> > > 
> > > Jason Dillaman
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > > > To: "Jason Dillaman" <dillaman@redhat.com>
> > > > Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
> > > > Sent: Wednesday, March 30, 2016 3:00:47 PM
> > > > Subject: Re: reads while 100% write
> > > > 
> > > > 1.5T in that run.
> > > > With 150G behavior is the same. Except it says "_do_read 0~18 size
> > > > 615030”
> > > > instead of 6M.
> > > > 
> > > > Also when random 4k write starts there are more reads then writes:
> > > > 
> > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > > avgrq-sz
> > > > avgqu-sz   await r_await w_await  svctm  %util
> > > > 
> > > > sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
> > > > 51.88
> > > >     0.36    1.06    0.00    1.06   0.91  31.20
> > > > sde              30.00     0.00   30.00  957.00 18120.00  3828.00
> > > > 44.47
> > > >     0.25    0.26    3.87    0.14   0.17  16.40
> > > > 
> > > > Logs: http://pastebin.com/gGzfR5ez
> > > > 
> > > > 
> > > > On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
> > > > 
> > > > >How large is your RBD image?  100 terabytes?
> > > > >
> > > > >--
> > > > >
> > > > >Jason Dillaman
> > > > >
> > > > >
> > > > >----- Original Message -----
> > > > >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > > > >> To: "Sage Weil" <sage@newdream.net>
> > > > >> Cc: ceph-devel@vger.kernel.org
> > > > >> Sent: Wednesday, March 30, 2016 2:14:12 PM
> > > > >> Subject: Re: reads while 100% write
> > > > >>
> > > > >> These are suspicious lines:
> > > > >>
> > > > >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > > >> read
> > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 6144018~6012
> > > > >> =
> > > > >> 6012
> > > > >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > > >> read
> > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > > >> _do_read 8210~4096 size 6150030
> > > > >> 2016-03-30 10:54:23.142267 7f2e933ff700  5 bdev(src/dev/osd0/block)
> > > > >> read
> > > > >> 8003854336~8192
> > > > >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > > >> read
> > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 =
> > > > >>4096
> > > > >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > > >>_write
> > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > > >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096 -
> > > > >>have
> > > > >> 6150030 bytes in 1 extents
> > > > >>
> > > > >> More logs here: http://pastebin.com/74WLzFYw
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
> > > > >>
> > > > >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> > > > >> >> After pulling master branch on Friday I start seeing odd fio
> > > > >>behavior, I
> > > > >> >> see a lot of reads while writing and very low performance no matter
> > > > >> >> whether it read or write workload.
> > > > >> >>
> > > > >> >> Output from sequential 1M write:
> > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > > >> >>avgrq-sz
> > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > > >> >>
> > > > >> >> sdd               0.00   409.00    0.00  364.00     0.00  3092.00
> > > > >> >>16.99
> > > > >> >>     0.28    0.78    0.00    0.78   0.76  27.60
> > > > >> >> sde               0.00   242.00  365.00  363.00  2436.00  9680.00
> > > > >> >>33.29
> > > > >> >>     0.18    0.24    0.42    0.07   0.23  16.80
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> block.db -> /dev/sdd
> > > > >> >> block -> /dev/sde
> > > > >> >>
> > > > >> >> health HEALTH_OK
> > > > >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> > > > >> >>        election epoch 3, quorum 0 a
> > > > >> >> osdmap e7: 1 osds: 1 up, 1 in
> > > > >> >>        flags sortbitwise
> > > > >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> > > > >> >>        8210 MB used, 178 GB / 186 GB avail
> > > > >> >>              64 active+clean
> > > > >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> > > > >> >>
> > > > >> >>
> > > > >> >> While on earlier revision(c1e41af) everything looks as expected:
> > > > >> >>
> > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > > >> >>avgrq-sz
> > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > > >> >> sdd               0.00  4910.00    0.00  680.00     0.00 22416.00
> > > > >> >>65.93
> > > > >> >>     1.05    1.55    0.00    1.55   1.18  80.00
> > > > >> >> sde               0.00     0.00    0.00 3418.00     0.00 217612.00
> > > > >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> > > > >> >>
> > > > >> >> Other observation, may be related to the issue, is that CPU load is
> > > > >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the rest
> > > > >> >> is
> > > > >> >>idle.
> > > > >> >> Looks like all load goes to single thread pool shard, earlier CPU
> > > > >> >> was
> > > > >> >>well
> > > > >> >> balanced.
> > > > >> >
> > > > >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug bdev
> > > > >> >=
> > > > >>20?
> > > > >> >
> > > > >> >Thanks!
> > > > >> >sage
> > > > >> >
> > > > >> >
> > > > >> >>
> > > > >> >>
> > > > >> >> ‹
> > > > >> >> Evgeniy
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> PLEASE NOTE: The information contained in this electronic mail
> > > > >>message
> > > > >> >>is intended only for the use of the designated recipient(s) named
> > > > >>above.
> > > > >> >>If the reader of this message is not the intended recipient, you are
> > > > >> >>hereby notified that you have received this message in error and
> > > > >> >>that
> > > > >> >>any review, dissemination, distribution, or copying of this message
> > > > >> >>is
> > > > >> >>strictly prohibited. If you have received this communication in
> > > > >> >>error,
> > > > >> >>please notify the sender by telephone or e-mail (as shown above)
> > > > >> >>immediately and destroy any and all copies of this message in your
> > > > >> >>possession (whether hard copies or electronically stored copies).
> > > > >> >> --
> > > > >> >> To unsubscribe from this list: send the line "unsubscribe
> > > > >>ceph-devel" in
> > > > >> >> the body of a message to majordomo@vger.kernel.org
> > > > >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >> PLEASE NOTE: The information contained in this electronic mail message
> > > > >>is
> > > > >> intended only for the use of the designated recipient(s) named above.
> > > > >>If the
> > > > >> reader of this message is not the intended recipient, you are hereby
> > > > >> notified that you have received this message in error and that any
> > > > >>review,
> > > > >> dissemination, distribution, or copying of this message is strictly
> > > > >> prohibited. If you have received this communication in error, please
> > > > >>notify
> > > > >> the sender by telephone or e-mail (as shown above) immediately and
> > > > >>destroy
> > > > >> any and all copies of this message in your possession (whether hard
> > > > >>copies
> > > > >> or electronically stored copies).
> > > > >>
> > > > >>N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v
> > > > >>?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > > 
> > > > PLEASE NOTE: The information contained in this electronic mail message is
> > > > intended only for the use of the designated recipient(s) named above. If
> > > > the
> > > > reader of this message is not the intended recipient, you are hereby
> > > > notified that you have received this message in error and that any
> > > > review,
> > > > dissemination, distribution, or copying of this message is strictly
> > > > prohibited. If you have received this communication in error, please
> > > > notify
> > > > the sender by telephone or e-mail (as shown above) immediately and
> > > > destroy
> > > > any and all copies of this message in your possession (whether hard
> > > > copies
> > > > or electronically stored copies).
> > > > N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > 
> > >
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 20:02 Sage Weil
@ 2016-03-30 20:29 ` Jason Dillaman
  2016-03-30 22:32   ` Sage Weil
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Dillaman @ 2016-03-30 20:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: Evgeniy Firsov, ceph-devel

Opened PR 8380 [1] to pass the WILLNEED flag for object map updates.

[1] https://github.com/ceph/ceph/pull/8380

-- 

Jason Dillaman 


----- Original Message -----
> From: "Sage Weil" <sage@newdream.net>
> To: "Jason Dillaman" <dillaman@redhat.com>
> Cc: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>, ceph-devel@vger.kernel.org
> Sent: Wednesday, March 30, 2016 4:02:16 PM
> Subject: Re: reads while 100% write
> 
> On Wed, 30 Mar 2016, Jason Dillaman wrote:
> > This IO is being performed within an OSD class method.  I can add a new
> > cls_cxx_read2 method to accept cache hints and update the associated
> > object map methods.  Would this apply to writes as well?
> 
> Yeah, we'll want to hint them both.
> 
> s
> 
> > 
> > --
> > 
> > Jason Dillaman
> > 
> > 
> > ----- Original Message -----
> > > From: "Sage Weil" <sage@newdream.net>
> > > To: "Jason Dillaman" <dillaman@redhat.com>
> > > Cc: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>,
> > > ceph-devel@vger.kernel.org
> > > Sent: Wednesday, March 30, 2016 3:55:14 PM
> > > Subject: Re: reads while 100% write
> > > 
> > > On Wed, 30 Mar 2016, Jason Dillaman wrote:
> > > > Are you using the RBD default of 4MB object sizes or are you using
> > > > something much smaller like 64KB?  An object map of that size should be
> > > > tracking up to 24,576,000 objects.  When you ran your test before, did
> > > > you have the RBD object map disabled?  This definitely seems to be a
> > > > use
> > > > case where the lack of a cache in front of BlueStore is hurting small
> > > > IO.
> > > 
> > > Using the rados cache hint WILLNEED is probably appropriate here..
> > > 
> > > sage
> > > 
> > > > 
> > > > --
> > > > 
> > > > Jason Dillaman
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > > > > To: "Jason Dillaman" <dillaman@redhat.com>
> > > > > Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
> > > > > Sent: Wednesday, March 30, 2016 3:00:47 PM
> > > > > Subject: Re: reads while 100% write
> > > > > 
> > > > > 1.5T in that run.
> > > > > With 150G behavior is the same. Except it says "_do_read 0~18 size
> > > > > 615030”
> > > > > instead of 6M.
> > > > > 
> > > > > Also when random 4k write starts there are more reads then writes:
> > > > > 
> > > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > > > avgrq-sz
> > > > > avgqu-sz   await r_await w_await  svctm  %util
> > > > > 
> > > > > sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
> > > > > 51.88
> > > > >     0.36    1.06    0.00    1.06   0.91  31.20
> > > > > sde              30.00     0.00   30.00  957.00 18120.00  3828.00
> > > > > 44.47
> > > > >     0.25    0.26    3.87    0.14   0.17  16.40
> > > > > 
> > > > > Logs: http://pastebin.com/gGzfR5ez
> > > > > 
> > > > > 
> > > > > On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
> > > > > 
> > > > > >How large is your RBD image?  100 terabytes?
> > > > > >
> > > > > >--
> > > > > >
> > > > > >Jason Dillaman
> > > > > >
> > > > > >
> > > > > >----- Original Message -----
> > > > > >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > > > > >> To: "Sage Weil" <sage@newdream.net>
> > > > > >> Cc: ceph-devel@vger.kernel.org
> > > > > >> Sent: Wednesday, March 30, 2016 2:14:12 PM
> > > > > >> Subject: Re: reads while 100% write
> > > > > >>
> > > > > >> These are suspicious lines:
> > > > > >>
> > > > > >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > > > >> read
> > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
> > > > > >> 6144018~6012
> > > > > >> =
> > > > > >> 6012
> > > > > >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > > > >> read
> > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > > > >> _do_read 8210~4096 size 6150030
> > > > > >> 2016-03-30 10:54:23.142267 7f2e933ff700  5
> > > > > >> bdev(src/dev/osd0/block)
> > > > > >> read
> > > > > >> 8003854336~8192
> > > > > >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > > > >> read
> > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > >> =
> > > > > >>4096
> > > > > >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > > > >>_write
> > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > > > >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
> > > > > >> 8210~4096 -
> > > > > >>have
> > > > > >> 6150030 bytes in 1 extents
> > > > > >>
> > > > > >> More logs here: http://pastebin.com/74WLzFYw
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
> > > > > >>
> > > > > >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> > > > > >> >> After pulling master branch on Friday I start seeing odd fio
> > > > > >>behavior, I
> > > > > >> >> see a lot of reads while writing and very low performance no
> > > > > >> >> matter
> > > > > >> >> whether it read or write workload.
> > > > > >> >>
> > > > > >> >> Output from sequential 1M write:
> > > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
> > > > > >> >> wkB/s
> > > > > >> >>avgrq-sz
> > > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > > > >> >>
> > > > > >> >> sdd               0.00   409.00    0.00  364.00     0.00
> > > > > >> >> 3092.00
> > > > > >> >>16.99
> > > > > >> >>     0.28    0.78    0.00    0.78   0.76  27.60
> > > > > >> >> sde               0.00   242.00  365.00  363.00  2436.00
> > > > > >> >> 9680.00
> > > > > >> >>33.29
> > > > > >> >>     0.18    0.24    0.42    0.07   0.23  16.80
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> block.db -> /dev/sdd
> > > > > >> >> block -> /dev/sde
> > > > > >> >>
> > > > > >> >> health HEALTH_OK
> > > > > >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> > > > > >> >>        election epoch 3, quorum 0 a
> > > > > >> >> osdmap e7: 1 osds: 1 up, 1 in
> > > > > >> >>        flags sortbitwise
> > > > > >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> > > > > >> >>        8210 MB used, 178 GB / 186 GB avail
> > > > > >> >>              64 active+clean
> > > > > >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> While on earlier revision(c1e41af) everything looks as
> > > > > >> >> expected:
> > > > > >> >>
> > > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
> > > > > >> >> wkB/s
> > > > > >> >>avgrq-sz
> > > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > > > >> >> sdd               0.00  4910.00    0.00  680.00     0.00
> > > > > >> >> 22416.00
> > > > > >> >>65.93
> > > > > >> >>     1.05    1.55    0.00    1.55   1.18  80.00
> > > > > >> >> sde               0.00     0.00    0.00 3418.00     0.00
> > > > > >> >> 217612.00
> > > > > >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> > > > > >> >>
> > > > > >> >> Other observation, may be related to the issue, is that CPU
> > > > > >> >> load is
> > > > > >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the
> > > > > >> >> rest
> > > > > >> >> is
> > > > > >> >>idle.
> > > > > >> >> Looks like all load goes to single thread pool shard, earlier
> > > > > >> >> CPU
> > > > > >> >> was
> > > > > >> >>well
> > > > > >> >> balanced.
> > > > > >> >
> > > > > >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug
> > > > > >> >bdev
> > > > > >> >=
> > > > > >>20?
> > > > > >> >
> > > > > >> >Thanks!
> > > > > >> >sage
> > > > > >> >
> > > > > >> >
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> ‹
> > > > > >> >> Evgeniy
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> PLEASE NOTE: The information contained in this electronic mail
> > > > > >>message
> > > > > >> >>is intended only for the use of the designated recipient(s)
> > > > > >> >>named
> > > > > >>above.
> > > > > >> >>If the reader of this message is not the intended recipient, you
> > > > > >> >>are
> > > > > >> >>hereby notified that you have received this message in error and
> > > > > >> >>that
> > > > > >> >>any review, dissemination, distribution, or copying of this
> > > > > >> >>message
> > > > > >> >>is
> > > > > >> >>strictly prohibited. If you have received this communication in
> > > > > >> >>error,
> > > > > >> >>please notify the sender by telephone or e-mail (as shown above)
> > > > > >> >>immediately and destroy any and all copies of this message in
> > > > > >> >>your
> > > > > >> >>possession (whether hard copies or electronically stored
> > > > > >> >>copies).
> > > > > >> >> --
> > > > > >> >> To unsubscribe from this list: send the line "unsubscribe
> > > > > >>ceph-devel" in
> > > > > >> >> the body of a message to majordomo@vger.kernel.org
> > > > > >> >> More majordomo info at
> > > > > >> >> http://vger.kernel.org/majordomo-info.html
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >> PLEASE NOTE: The information contained in this electronic mail
> > > > > >> message
> > > > > >>is
> > > > > >> intended only for the use of the designated recipient(s) named
> > > > > >> above.
> > > > > >>If the
> > > > > >> reader of this message is not the intended recipient, you are
> > > > > >> hereby
> > > > > >> notified that you have received this message in error and that any
> > > > > >>review,
> > > > > >> dissemination, distribution, or copying of this message is
> > > > > >> strictly
> > > > > >> prohibited. If you have received this communication in error,
> > > > > >> please
> > > > > >>notify
> > > > > >> the sender by telephone or e-mail (as shown above) immediately and
> > > > > >>destroy
> > > > > >> any and all copies of this message in your possession (whether
> > > > > >> hard
> > > > > >>copies
> > > > > >> or electronically stored copies).
> > > > > >>
> > > > > >>N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v
> > > > > >>?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > > > 
> > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > message is
> > > > > intended only for the use of the designated recipient(s) named above.
> > > > > If
> > > > > the
> > > > > reader of this message is not the intended recipient, you are hereby
> > > > > notified that you have received this message in error and that any
> > > > > review,
> > > > > dissemination, distribution, or copying of this message is strictly
> > > > > prohibited. If you have received this communication in error, please
> > > > > notify
> > > > > the sender by telephone or e-mail (as shown above) immediately and
> > > > > destroy
> > > > > any and all copies of this message in your possession (whether hard
> > > > > copies
> > > > > or electronically stored copies).
> > > > > N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > > 
> > > >
> > 
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 20:29 ` Jason Dillaman
@ 2016-03-30 22:32   ` Sage Weil
  2016-03-31  3:43     ` Evgeniy Firsov
  0 siblings, 1 reply; 16+ messages in thread
From: Sage Weil @ 2016-03-30 22:32 UTC (permalink / raw)
  To: Jason Dillaman; +Cc: Evgeniy Firsov, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 12295 bytes --]

Evgeniy,

Do you mind repeating your test with this code applied?

Thanks!
sage


On Wed, 30 Mar 2016, Jason Dillaman wrote:

> Opened PR 8380 [1] to pass the WILLNEED flag for object map updates.
> 
> [1] https://github.com/ceph/ceph/pull/8380
> 
> -- 
> 
> Jason Dillaman 
> 
> 
> ----- Original Message -----
> > From: "Sage Weil" <sage@newdream.net>
> > To: "Jason Dillaman" <dillaman@redhat.com>
> > Cc: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>, ceph-devel@vger.kernel.org
> > Sent: Wednesday, March 30, 2016 4:02:16 PM
> > Subject: Re: reads while 100% write
> > 
> > On Wed, 30 Mar 2016, Jason Dillaman wrote:
> > > This IO is being performed within an OSD class method.  I can add a new
> > > cls_cxx_read2 method to accept cache hints and update the associated
> > > object map methods.  Would this apply to writes as well?
> > 
> > Yeah, we'll want to hint them both.
> > 
> > s
> > 
> > > 
> > > --
> > > 
> > > Jason Dillaman
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Sage Weil" <sage@newdream.net>
> > > > To: "Jason Dillaman" <dillaman@redhat.com>
> > > > Cc: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>,
> > > > ceph-devel@vger.kernel.org
> > > > Sent: Wednesday, March 30, 2016 3:55:14 PM
> > > > Subject: Re: reads while 100% write
> > > > 
> > > > On Wed, 30 Mar 2016, Jason Dillaman wrote:
> > > > > Are you using the RBD default of 4MB object sizes or are you using
> > > > > something much smaller like 64KB?  An object map of that size should be
> > > > > tracking up to 24,576,000 objects.  When you ran your test before, did
> > > > > you have the RBD object map disabled?  This definitely seems to be a
> > > > > use
> > > > > case where the lack of a cache in front of BlueStore is hurting small
> > > > > IO.
> > > > 
> > > > Using the rados cache hint WILLNEED is probably appropriate here..
> > > > 
> > > > sage
> > > > 
> > > > > 
> > > > > --
> > > > > 
> > > > > Jason Dillaman
> > > > > 
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > > > > > To: "Jason Dillaman" <dillaman@redhat.com>
> > > > > > Cc: "Sage Weil" <sage@newdream.net>, ceph-devel@vger.kernel.org
> > > > > > Sent: Wednesday, March 30, 2016 3:00:47 PM
> > > > > > Subject: Re: reads while 100% write
> > > > > > 
> > > > > > 1.5T in that run.
> > > > > > With 150G behavior is the same. Except it says "_do_read 0~18 size
> > > > > > 615030”
> > > > > > instead of 6M.
> > > > > > 
> > > > > > Also when random 4k write starts there are more reads then writes:
> > > > > > 
> > > > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > > > > > avgrq-sz
> > > > > > avgqu-sz   await r_await w_await  svctm  %util
> > > > > > 
> > > > > > sdd               0.00  1887.00    0.00  344.00     0.00  8924.00
> > > > > > 51.88
> > > > > >     0.36    1.06    0.00    1.06   0.91  31.20
> > > > > > sde              30.00     0.00   30.00  957.00 18120.00  3828.00
> > > > > > 44.47
> > > > > >     0.25    0.26    3.87    0.14   0.17  16.40
> > > > > > 
> > > > > > Logs: http://pastebin.com/gGzfR5ez
> > > > > > 
> > > > > > 
> > > > > > On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com> wrote:
> > > > > > 
> > > > > > >How large is your RBD image?  100 terabytes?
> > > > > > >
> > > > > > >--
> > > > > > >
> > > > > > >Jason Dillaman
> > > > > > >
> > > > > > >
> > > > > > >----- Original Message -----
> > > > > > >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
> > > > > > >> To: "Sage Weil" <sage@newdream.net>
> > > > > > >> Cc: ceph-devel@vger.kernel.org
> > > > > > >> Sent: Wednesday, March 30, 2016 2:14:12 PM
> > > > > > >> Subject: Re: reads while 100% write
> > > > > > >>
> > > > > > >> These are suspicious lines:
> > > > > > >>
> > > > > > >> 2016-03-30 10:54:23.142205 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > > > > >> read
> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
> > > > > > >> 6144018~6012
> > > > > > >> =
> > > > > > >> 6012
> > > > > > >> 2016-03-30 10:54:23.142252 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > > > > >> read
> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > > >> 2016-03-30 10:54:23.142260 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > > > > >> _do_read 8210~4096 size 6150030
> > > > > > >> 2016-03-30 10:54:23.142267 7f2e933ff700  5
> > > > > > >> bdev(src/dev/osd0/block)
> > > > > > >> read
> > > > > > >> 8003854336~8192
> > > > > > >> 2016-03-30 10:54:23.142609 7f2e933ff700 10 bluestore(src/dev/osd0)
> > > > > > >> read
> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > > >> =
> > > > > > >>4096
> > > > > > >> 2016-03-30 10:54:23.142882 7f2e933ff700 15 bluestore(src/dev/osd0)
> > > > > > >>_write
> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head# 8210~4096
> > > > > > >> 2016-03-30 10:54:23.142888 7f2e933ff700 20 bluestore(src/dev/osd0)
> > > > > > >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
> > > > > > >> 8210~4096 -
> > > > > > >>have
> > > > > > >> 6150030 bytes in 1 extents
> > > > > > >>
> > > > > > >> More logs here: http://pastebin.com/74WLzFYw
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
> > > > > > >>
> > > > > > >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
> > > > > > >> >> After pulling master branch on Friday I start seeing odd fio
> > > > > > >>behavior, I
> > > > > > >> >> see a lot of reads while writing and very low performance no
> > > > > > >> >> matter
> > > > > > >> >> whether it read or write workload.
> > > > > > >> >>
> > > > > > >> >> Output from sequential 1M write:
> > > > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
> > > > > > >> >> wkB/s
> > > > > > >> >>avgrq-sz
> > > > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > > > > >> >>
> > > > > > >> >> sdd               0.00   409.00    0.00  364.00     0.00
> > > > > > >> >> 3092.00
> > > > > > >> >>16.99
> > > > > > >> >>     0.28    0.78    0.00    0.78   0.76  27.60
> > > > > > >> >> sde               0.00   242.00  365.00  363.00  2436.00
> > > > > > >> >> 9680.00
> > > > > > >> >>33.29
> > > > > > >> >>     0.18    0.24    0.42    0.07   0.23  16.80
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> block.db -> /dev/sdd
> > > > > > >> >> block -> /dev/sde
> > > > > > >> >>
> > > > > > >> >> health HEALTH_OK
> > > > > > >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
> > > > > > >> >>        election epoch 3, quorum 0 a
> > > > > > >> >> osdmap e7: 1 osds: 1 up, 1 in
> > > > > > >> >>        flags sortbitwise
> > > > > > >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
> > > > > > >> >>        8210 MB used, 178 GB / 186 GB avail
> > > > > > >> >>              64 active+clean
> > > > > > >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387 op/s wr
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> While on earlier revision(c1e41af) everything looks as
> > > > > > >> >> expected:
> > > > > > >> >>
> > > > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
> > > > > > >> >> wkB/s
> > > > > > >> >>avgrq-sz
> > > > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
> > > > > > >> >> sdd               0.00  4910.00    0.00  680.00     0.00
> > > > > > >> >> 22416.00
> > > > > > >> >>65.93
> > > > > > >> >>     1.05    1.55    0.00    1.55   1.18  80.00
> > > > > > >> >> sde               0.00     0.00    0.00 3418.00     0.00
> > > > > > >> >> 217612.00
> > > > > > >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
> > > > > > >> >>
> > > > > > >> >> Other observation, may be related to the issue, is that CPU
> > > > > > >> >> load is
> > > > > > >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy, while the
> > > > > > >> >> rest
> > > > > > >> >> is
> > > > > > >> >>idle.
> > > > > > >> >> Looks like all load goes to single thread pool shard, earlier
> > > > > > >> >> CPU
> > > > > > >> >> was
> > > > > > >> >>well
> > > > > > >> >> balanced.
> > > > > > >> >
> > > > > > >> >Hmm.  Can you capture a log with debug bluestore = 20 and debug
> > > > > > >> >bdev
> > > > > > >> >=
> > > > > > >>20?
> > > > > > >> >
> > > > > > >> >Thanks!
> > > > > > >> >sage
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> ‹
> > > > > > >> >> Evgeniy
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> PLEASE NOTE: The information contained in this electronic mail
> > > > > > >>message
> > > > > > >> >>is intended only for the use of the designated recipient(s)
> > > > > > >> >>named
> > > > > > >>above.
> > > > > > >> >>If the reader of this message is not the intended recipient, you
> > > > > > >> >>are
> > > > > > >> >>hereby notified that you have received this message in error and
> > > > > > >> >>that
> > > > > > >> >>any review, dissemination, distribution, or copying of this
> > > > > > >> >>message
> > > > > > >> >>is
> > > > > > >> >>strictly prohibited. If you have received this communication in
> > > > > > >> >>error,
> > > > > > >> >>please notify the sender by telephone or e-mail (as shown above)
> > > > > > >> >>immediately and destroy any and all copies of this message in
> > > > > > >> >>your
> > > > > > >> >>possession (whether hard copies or electronically stored
> > > > > > >> >>copies).
> > > > > > >> >> --
> > > > > > >> >> To unsubscribe from this list: send the line "unsubscribe
> > > > > > >>ceph-devel" in
> > > > > > >> >> the body of a message to majordomo@vger.kernel.org
> > > > > > >> >> More majordomo info at
> > > > > > >> >> http://vger.kernel.org/majordomo-info.html
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > > >> PLEASE NOTE: The information contained in this electronic mail
> > > > > > >> message
> > > > > > >>is
> > > > > > >> intended only for the use of the designated recipient(s) named
> > > > > > >> above.
> > > > > > >>If the
> > > > > > >> reader of this message is not the intended recipient, you are
> > > > > > >> hereby
> > > > > > >> notified that you have received this message in error and that any
> > > > > > >>review,
> > > > > > >> dissemination, distribution, or copying of this message is
> > > > > > >> strictly
> > > > > > >> prohibited. If you have received this communication in error,
> > > > > > >> please
> > > > > > >>notify
> > > > > > >> the sender by telephone or e-mail (as shown above) immediately and
> > > > > > >>destroy
> > > > > > >> any and all copies of this message in your possession (whether
> > > > > > >> hard
> > > > > > >>copies
> > > > > > >> or electronically stored copies).
> > > > > > >>
> > > > > > >>N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v
> > > > > > >>?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > > > > 
> > > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > > message is
> > > > > > intended only for the use of the designated recipient(s) named above.
> > > > > > If
> > > > > > the
> > > > > > reader of this message is not the intended recipient, you are hereby
> > > > > > notified that you have received this message in error and that any
> > > > > > review,
> > > > > > dissemination, distribution, or copying of this message is strictly
> > > > > > prohibited. If you have received this communication in error, please
> > > > > > notify
> > > > > > the sender by telephone or e-mail (as shown above) immediately and
> > > > > > destroy
> > > > > > any and all copies of this message in your possession (whether hard
> > > > > > copies
> > > > > > or electronically stored copies).
> > > > > > N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w??????????????????j:+v?????????w???j???m????????????????????????zZ+??????ݢj"??????
> > > > > 
> > > > >
> > > 
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: reads while 100% write
  2016-03-30 22:32   ` Sage Weil
@ 2016-03-31  3:43     ` Evgeniy Firsov
  0 siblings, 0 replies; 16+ messages in thread
From: Evgeniy Firsov @ 2016-03-31  3:43 UTC (permalink / raw)
  To: Sage Weil, Jason Dillaman; +Cc: ceph-devel@vger.kernel.org

With this change reads gone, but performance still low.

On 3/30/16, 3:32 PM, "Sage Weil" <sage@newdream.net> wrote:

>Evgeniy,
>
>Do you mind repeating your test with this code applied?
>
>Thanks!
>sage
>
>
>On Wed, 30 Mar 2016, Jason Dillaman wrote:
>
>> Opened PR 8380 [1] to pass the WILLNEED flag for object map updates.
>>
>> [1] https://github.com/ceph/ceph/pull/8380
>>
>> --
>>
>> Jason Dillaman
>>
>>
>> ----- Original Message -----
>> > From: "Sage Weil" <sage@newdream.net>
>> > To: "Jason Dillaman" <dillaman@redhat.com>
>> > Cc: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>,
>>ceph-devel@vger.kernel.org
>> > Sent: Wednesday, March 30, 2016 4:02:16 PM
>> > Subject: Re: reads while 100% write
>> >
>> > On Wed, 30 Mar 2016, Jason Dillaman wrote:
>> > > This IO is being performed within an OSD class method.  I can add a
>>new
>> > > cls_cxx_read2 method to accept cache hints and update the associated
>> > > object map methods.  Would this apply to writes as well?
>> >
>> > Yeah, we'll want to hint them both.
>> >
>> > s
>> >
>> > >
>> > > --
>> > >
>> > > Jason Dillaman
>> > >
>> > >
>> > > ----- Original Message -----
>> > > > From: "Sage Weil" <sage@newdream.net>
>> > > > To: "Jason Dillaman" <dillaman@redhat.com>
>> > > > Cc: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>,
>> > > > ceph-devel@vger.kernel.org
>> > > > Sent: Wednesday, March 30, 2016 3:55:14 PM
>> > > > Subject: Re: reads while 100% write
>> > > >
>> > > > On Wed, 30 Mar 2016, Jason Dillaman wrote:
>> > > > > Are you using the RBD default of 4MB object sizes or are you
>>using
>> > > > > something much smaller like 64KB?  An object map of that size
>>should be
>> > > > > tracking up to 24,576,000 objects.  When you ran your test
>>before, did
>> > > > > you have the RBD object map disabled?  This definitely seems to
>>be a
>> > > > > use
>> > > > > case where the lack of a cache in front of BlueStore is hurting
>>small
>> > > > > IO.
>> > > >
>> > > > Using the rados cache hint WILLNEED is probably appropriate here..
>> > > >
>> > > > sage
>> > > >
>> > > > >
>> > > > > --
>> > > > >
>> > > > > Jason Dillaman
>> > > > >
>> > > > >
>> > > > > ----- Original Message -----
>> > > > > > From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>> > > > > > To: "Jason Dillaman" <dillaman@redhat.com>
>> > > > > > Cc: "Sage Weil" <sage@newdream.net>,
>>ceph-devel@vger.kernel.org
>> > > > > > Sent: Wednesday, March 30, 2016 3:00:47 PM
>> > > > > > Subject: Re: reads while 100% write
>> > > > > >
>> > > > > > 1.5T in that run.
>> > > > > > With 150G behavior is the same. Except it says "_do_read 0~18
>>size
>> > > > > > 615030”
>> > > > > > instead of 6M.
>> > > > > >
>> > > > > > Also when random 4k write starts there are more reads then
>>writes:
>> > > > > >
>> > > > > > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
>>wkB/s
>> > > > > > avgrq-sz
>> > > > > > avgqu-sz   await r_await w_await  svctm  %util
>> > > > > >
>> > > > > > sdd               0.00  1887.00    0.00  344.00     0.00
>>8924.00
>> > > > > > 51.88
>> > > > > >     0.36    1.06    0.00    1.06   0.91  31.20
>> > > > > > sde              30.00     0.00   30.00  957.00 18120.00
>>3828.00
>> > > > > > 44.47
>> > > > > >     0.25    0.26    3.87    0.14   0.17  16.40
>> > > > > >
>> > > > > > Logs: http://pastebin.com/gGzfR5ez
>> > > > > >
>> > > > > >
>> > > > > > On 3/30/16, 11:37 AM, "Jason Dillaman" <dillaman@redhat.com>
>>wrote:
>> > > > > >
>> > > > > > >How large is your RBD image?  100 terabytes?
>> > > > > > >
>> > > > > > >--
>> > > > > > >
>> > > > > > >Jason Dillaman
>> > > > > > >
>> > > > > > >
>> > > > > > >----- Original Message -----
>> > > > > > >> From: "Evgeniy Firsov" <Evgeniy.Firsov@sandisk.com>
>> > > > > > >> To: "Sage Weil" <sage@newdream.net>
>> > > > > > >> Cc: ceph-devel@vger.kernel.org
>> > > > > > >> Sent: Wednesday, March 30, 2016 2:14:12 PM
>> > > > > > >> Subject: Re: reads while 100% write
>> > > > > > >>
>> > > > > > >> These are suspicious lines:
>> > > > > > >>
>> > > > > > >> 2016-03-30 10:54:23.142205 7f2e933ff700 10
>>bluestore(src/dev/osd0)
>> > > > > > >> read
>> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>> > > > > > >> 6144018~6012
>> > > > > > >> =
>> > > > > > >> 6012
>> > > > > > >> 2016-03-30 10:54:23.142252 7f2e933ff700 15
>>bluestore(src/dev/osd0)
>> > > > > > >> read
>> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>8210~4096
>> > > > > > >> 2016-03-30 10:54:23.142260 7f2e933ff700 20
>>bluestore(src/dev/osd0)
>> > > > > > >> _do_read 8210~4096 size 6150030
>> > > > > > >> 2016-03-30 10:54:23.142267 7f2e933ff700  5
>> > > > > > >> bdev(src/dev/osd0/block)
>> > > > > > >> read
>> > > > > > >> 8003854336~8192
>> > > > > > >> 2016-03-30 10:54:23.142609 7f2e933ff700 10
>>bluestore(src/dev/osd0)
>> > > > > > >> read
>> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>8210~4096
>> > > > > > >> =
>> > > > > > >>4096
>> > > > > > >> 2016-03-30 10:54:23.142882 7f2e933ff700 15
>>bluestore(src/dev/osd0)
>> > > > > > >>_write
>> > > > > > >> 0.d_head #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>>8210~4096
>> > > > > > >> 2016-03-30 10:54:23.142888 7f2e933ff700 20
>>bluestore(src/dev/osd0)
>> > > > > > >> _do_write #0:b06b5e8e:::rbd_object_map.10046b8b4567:head#
>> > > > > > >> 8210~4096 -
>> > > > > > >>have
>> > > > > > >> 6150030 bytes in 1 extents
>> > > > > > >>
>> > > > > > >> More logs here: http://pastebin.com/74WLzFYw
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On 3/30/16, 4:19 AM, "Sage Weil" <sage@newdream.net> wrote:
>> > > > > > >>
>> > > > > > >> >On Wed, 30 Mar 2016, Evgeniy Firsov wrote:
>> > > > > > >> >> After pulling master branch on Friday I start seeing
>>odd fio
>> > > > > > >>behavior, I
>> > > > > > >> >> see a lot of reads while writing and very low
>>performance no
>> > > > > > >> >> matter
>> > > > > > >> >> whether it read or write workload.
>> > > > > > >> >>
>> > > > > > >> >> Output from sequential 1M write:
>> > > > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
>> > > > > > >> >> wkB/s
>> > > > > > >> >>avgrq-sz
>> > > > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
>> > > > > > >> >>
>> > > > > > >> >> sdd               0.00   409.00    0.00  364.00     0.00
>> > > > > > >> >> 3092.00
>> > > > > > >> >>16.99
>> > > > > > >> >>     0.28    0.78    0.00    0.78   0.76  27.60
>> > > > > > >> >> sde               0.00   242.00  365.00  363.00  2436.00
>> > > > > > >> >> 9680.00
>> > > > > > >> >>33.29
>> > > > > > >> >>     0.18    0.24    0.42    0.07   0.23  16.80
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >> >> block.db -> /dev/sdd
>> > > > > > >> >> block -> /dev/sde
>> > > > > > >> >>
>> > > > > > >> >> health HEALTH_OK
>> > > > > > >> >> monmap e1: 1 mons at {a=127.0.0.1:6789/0}
>> > > > > > >> >>        election epoch 3, quorum 0 a
>> > > > > > >> >> osdmap e7: 1 osds: 1 up, 1 in
>> > > > > > >> >>        flags sortbitwise
>> > > > > > >> >> pgmap v24: 64 pgs, 1 pools, 577 MB data, 9152 objects
>> > > > > > >> >>        8210 MB used, 178 GB / 186 GB avail
>> > > > > > >> >>              64 active+clean
>> > > > > > >> >> client io 1550 kB/s rd, 9559 kB/s wr, 645 op/s rd, 387
>>op/s wr
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >> >> While on earlier revision(c1e41af) everything looks as
>> > > > > > >> >> expected:
>> > > > > > >> >>
>> > > > > > >> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s
>> > > > > > >> >> wkB/s
>> > > > > > >> >>avgrq-sz
>> > > > > > >> >> avgqu-sz   await r_await w_await  svctm  %util
>> > > > > > >> >> sdd               0.00  4910.00    0.00  680.00     0.00
>> > > > > > >> >> 22416.00
>> > > > > > >> >>65.93
>> > > > > > >> >>     1.05    1.55    0.00    1.55   1.18  80.00
>> > > > > > >> >> sde               0.00     0.00    0.00 3418.00     0.00
>> > > > > > >> >> 217612.00
>> > > > > > >> >> 127.33    63.78   18.18    0.00   18.18   0.25  86.40
>> > > > > > >> >>
>> > > > > > >> >> Other observation, may be related to the issue, is that
>>CPU
>> > > > > > >> >> load is
>> > > > > > >> >> imbalanced. Single ³tp_osd_tp² thread is 100% busy,
>>while the
>> > > > > > >> >> rest
>> > > > > > >> >> is
>> > > > > > >> >>idle.
>> > > > > > >> >> Looks like all load goes to single thread pool shard,
>>earlier
>> > > > > > >> >> CPU
>> > > > > > >> >> was
>> > > > > > >> >>well
>> > > > > > >> >> balanced.
>> > > > > > >> >
>> > > > > > >> >Hmm.  Can you capture a log with debug bluestore = 20 and
>>debug
>> > > > > > >> >bdev
>> > > > > > >> >=
>> > > > > > >>20?
>> > > > > > >> >
>> > > > > > >> >Thanks!
>> > > > > > >> >sage
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >> >> ‹
>> > > > > > >> >> Evgeniy
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >> >> PLEASE NOTE: The information contained in this
>>electronic mail
>> > > > > > >>message
>> > > > > > >> >>is intended only for the use of the designated
>>recipient(s)
>> > > > > > >> >>named
>> > > > > > >>above.
>> > > > > > >> >>If the reader of this message is not the intended
>>recipient, you
>> > > > > > >> >>are
>> > > > > > >> >>hereby notified that you have received this message in
>>error and
>> > > > > > >> >>that
>> > > > > > >> >>any review, dissemination, distribution, or copying of
>>this
>> > > > > > >> >>message
>> > > > > > >> >>is
>> > > > > > >> >>strictly prohibited. If you have received this
>>communication in
>> > > > > > >> >>error,
>> > > > > > >> >>please notify the sender by telephone or e-mail (as
>>shown above)
>> > > > > > >> >>immediately and destroy any and all copies of this
>>message in
>> > > > > > >> >>your
>> > > > > > >> >>possession (whether hard copies or electronically stored
>> > > > > > >> >>copies).
>> > > > > > >> >> --
>> > > > > > >> >> To unsubscribe from this list: send the line
>>"unsubscribe
>> > > > > > >>ceph-devel" in
>> > > > > > >> >> the body of a message to majordomo@vger.kernel.org
>> > > > > > >> >> More majordomo info at
>> > > > > > >> >> http://vger.kernel.org/majordomo-info.html
>> > > > > > >> >>
>> > > > > > >> >>
>> > > > > > >>
>> > > > > > >> PLEASE NOTE: The information contained in this electronic
>>mail
>> > > > > > >> message
>> > > > > > >>is
>> > > > > > >> intended only for the use of the designated recipient(s)
>>named
>> > > > > > >> above.
>> > > > > > >>If the
>> > > > > > >> reader of this message is not the intended recipient, you
>>are
>> > > > > > >> hereby
>> > > > > > >> notified that you have received this message in error and
>>that any
>> > > > > > >>review,
>> > > > > > >> dissemination, distribution, or copying of this message is
>> > > > > > >> strictly
>> > > > > > >> prohibited. If you have received this communication in
>>error,
>> > > > > > >> please
>> > > > > > >>notify
>> > > > > > >> the sender by telephone or e-mail (as shown above)
>>immediately and
>> > > > > > >>destroy
>> > > > > > >> any and all copies of this message in your possession
>>(whether
>> > > > > > >> hard
>> > > > > > >>copies
>> > > > > > >> or electronically stored copies).
>> > > > > > >>
>> > > > > >
>>>>N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????
>>>>z???]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w?????????
>>>>?????????j:+v
>> > > > > > >>?????????w???j???m????????????????????????zZ+??????ݢj"??????
>> > > > > >
>> > > > > > PLEASE NOTE: The information contained in this electronic mail
>> > > > > > message is
>> > > > > > intended only for the use of the designated recipient(s)
>>named above.
>> > > > > > If
>> > > > > > the
>> > > > > > reader of this message is not the intended recipient, you are
>>hereby
>> > > > > > notified that you have received this message in error and
>>that any
>> > > > > > review,
>> > > > > > dissemination, distribution, or copying of this message is
>>strictly
>> > > > > > prohibited. If you have received this communication in error,
>>please
>> > > > > > notify
>> > > > > > the sender by telephone or e-mail (as shown above)
>>immediately and
>> > > > > > destroy
>> > > > > > any and all copies of this message in your possession
>>(whether hard
>> > > > > > copies
>> > > > > > or electronically stored copies).
>> > > > > >
>>N???????????????r??????y?????????b???X??????ǧv???^???)޺{.n???+?????????z?
>>??]z?????????{ay???ʇڙ???,j??????f?????????h?????????z??????w?????????????
>>?????j:+v?????????w???j???m????????????????????????zZ+??????ݢj"??????
>> > > > >
>> > > > >
>> > >
>> > >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-03-31  3:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-30  2:35 reads while 100% write Evgeniy Firsov
2016-03-30 11:19 ` Sage Weil
2016-03-30 18:14   ` Evgeniy Firsov
2016-03-30 18:37     ` Jason Dillaman
2016-03-30 19:00       ` Evgeniy Firsov
2016-03-30 19:10         ` Jason Dillaman
2016-03-30 19:55           ` Sage Weil
2016-03-30 19:59             ` Jason Dillaman
2016-03-30 20:39           ` Evgeniy Firsov
2016-03-30 20:47             ` Jason Dillaman
2016-03-30 21:49               ` Evgeniy Firsov
2016-03-30 21:59                 ` Josh Durgin
  -- strict thread matches above, loose matches on Subject: below --
2016-03-30 20:02 Sage Weil
2016-03-30 20:29 ` Jason Dillaman
2016-03-30 22:32   ` Sage Weil
2016-03-31  3:43     ` Evgeniy Firsov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.