Request for Comments: Weighted Round Robin OP Queue

All of lore.kernel.org
 help / color / mirror / Atom feed

* Request for Comments: Weighted Round Robin OP Queue
@ 2015-11-04 16:54 Robert LeBlanc
  2015-11-04 19:49 ` Samuel Just
  0 siblings, 1 reply; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-04 16:54 UTC (permalink / raw)
  To: ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I've got some rough code that changes out the token bucket queue in
PrioritizedQueue.h with a weighted round robin queue located at [1].
Even though there is some more optimizations that can be done, running
the fio job [2], I've seen about a ~20% performance increase on
spindles and ~6% performance increase on SSDs (my hosts are CPU bound
on SSD).

The idea of this queue is to try to be fair to all OPs relative to
their priority while at the same time reducing the overhead for each
OP (queue and dequeue) from O(n) to closer to O(1).

One issue that I'm having is that under certain workloads and usually
during recovery I get these asserts and need help pinpointing how to
resolve it.

 osd/PG.cc: In function 'void PG::add_log_entry(const pg_log_entry_t&,
ceph::bufferlist&)' thread 7f55d61fd700 time 2015-11-03
14:44:28.638112
osd/PG.cc: 2923: FAILED assert(e.version > info.last_update)
osd/PG.cc: In function 'void PG::add_log_entry(const pg_log_entry_t&,
ceph::bufferlist&)' thread 7f55d7a00700 time 2015-11-03
14:44:28.637053
osd/PG.cc: 2923: FAILED assert(e.version > info.last_update)
 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x76) [0xc1e3a6]
 2: ceph-osd() [0x7d5a7c]
 3: (PG::append_log(std::vector > const&, eversion_t, eversion_t,
ObjectStore::Transaction&, bool)+0x111) [0x7f7181]
 4: (ReplicatedPG::log_operation(std::vector > const&,
boost::optional&, eversion_t const&, eversion_t const&, bool,
ObjectStore::Transaction*)+0xad) [0x8bfc7d]
 5: (void ReplicatedBackend::sub_op_modify_impl(std::tr1::shared_ptr)+0x7b9)
[0xa5e119]
 6: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr)+0x4a) [0xa4950a]
 7: (ReplicatedBackend::handle_message(std::tr1::shared_ptr)+0x363) [0xa49923]
 8: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
ThreadPool::TPHandle&)+0x159) [0x847ae9]
 9: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr,
ThreadPool::TPHandle&)+0x3cf) [0x690cef]
 10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x469) [0x691359]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89e)
[0xc0d8ae]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc0fa00]
 13: (()+0x80a4) [0x7f55f9edd0a4]
 14: (clone()+0x6d) [0x7f55f843904d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.

I think this means that the PG log to be appended is newer than what
is expected, but I'm not sure how to rectify it. Any pushes in the
right direction would be helpful.

It seems that this queue is helping with recover ops even when
osd_max_backfills=20 with max client ops, but I don't have good long
term data due to this issue. I think this has also impacted my SSD
testing as I lose one OSD during the test, reducing the performance
temporarily.

When looking through my code, please remember.
1. This may be the first time I wrote C++ code, or it has been long
enough it seems like it.
2. There is still some optimizations that I know can be done. But I'm
happy to have people share any optimization opportunities they see.
3. I'm trying to understand the reason for the assert and pointers how
to resolve it.
4. It seems like there are multiple threads of the queue keeping the
queues pretty small. How can I limit the queue to one thread so all
OPs have to be queued in one queue? I'd like to see the differences
with changing this.
5. I'd like any pointers to improving this code.

Thank you,
Robert LeBlanc

[1] https://github.com/ceph/ceph/compare/hammer...rldleblanc:wrr-queue
[2] [rbd-test]
#readwrite=write
#blocksize=4M
runtime=600
name=rbd-test
readwrite=randrw
bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
rwmixread=72
norandommap
#size=1T
#blocksize=4k
ioengine=rbd
rbdname=test5
pool=ssd-pool
clientname=admin
iodepth=8
numjobs=4
thread
group_reporting
time_based
#direct=1
ramp_time=60

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWOjheCRDmVDuy+mK58QAAfxkQAJjgP4cjtHiFdtZgR2Zo
yMPeV1b+ZYoQr4XbyCqWAsRdgigdcesCnjxyTOWnK+nHZgxMOgtHn8rylltV
17NzleGKfQUDRe7jLHLOaLDMphODvW0BjJHV8uk5DzYVJhVOhT5oHtJTtRXY
JtMCIaGcwEPSP9IE+bkzX22fPEeNnkCHFAosmratD2WIeaNrOfV0DNOfAotO
FX2/w0NtiuNqr+KEH3MrPdHkENXLhG2A8wiLqJ7sN0LvclwGbO9eZ01sv5nV
bqqS8dQjd4oh31799vBroX73uMOb+ljeXNguz/4l4Tekn+F3m5puFHEX2o23
NroU1YHNcKFAOwppZ7pDrAn3ATzvOEsZ7574dJw5vPxquCgsF0T8/phsk71D
E1IOQC/EIqCw4wUnujwlEZXwlSXRLyqT5xUrSXo/qtM4HUz4PmWukxZxOmk/
Afewcbq/5ElSZQus1xmMdmtGocSGAvMmYthIbXP+3l2127bMK2ptacL6VMSf
uO+wYCLQZDnpjlx9DYt4CAEbEeuS4vCSzIkGishcuFNHGmM/gXXqYFybAATt
IbLRWZBrq4TyfJe9sIp6aNPbi/IHxSV4NVVX3q1P2j91UDKKVL6hu9Ln0HTY
UrFuDnH0yjvwBm4vJ0ksoWLIWTciLTTz68ZyOnOnr+uXGbkQEz1LzMQWZ+Cl
saYj
=R1Lm
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-04 16:54 Request for Comments: Weighted Round Robin OP Queue Robert LeBlanc
@ 2015-11-04 19:49 ` Samuel Just
  2015-11-05  3:00   ` Robert LeBlanc
  0 siblings, 1 reply; 24+ messages in thread
From: Samuel Just @ 2015-11-04 19:49 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-devel

I didn't look into it closely, but that almost certainly means that
your queue is reordering primary->replica replicated write messages.
-Sam

On Wed, Nov 4, 2015 at 8:54 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I've got some rough code that changes out the token bucket queue in
> PrioritizedQueue.h with a weighted round robin queue located at [1].
> Even though there is some more optimizations that can be done, running
> the fio job [2], I've seen about a ~20% performance increase on
> spindles and ~6% performance increase on SSDs (my hosts are CPU bound
> on SSD).
>
> The idea of this queue is to try to be fair to all OPs relative to
> their priority while at the same time reducing the overhead for each
> OP (queue and dequeue) from O(n) to closer to O(1).
>
> One issue that I'm having is that under certain workloads and usually
> during recovery I get these asserts and need help pinpointing how to
> resolve it.
>
>  osd/PG.cc: In function 'void PG::add_log_entry(const pg_log_entry_t&,
> ceph::bufferlist&)' thread 7f55d61fd700 time 2015-11-03
> 14:44:28.638112
> osd/PG.cc: 2923: FAILED assert(e.version > info.last_update)
> osd/PG.cc: In function 'void PG::add_log_entry(const pg_log_entry_t&,
> ceph::bufferlist&)' thread 7f55d7a00700 time 2015-11-03
> 14:44:28.637053
> osd/PG.cc: 2923: FAILED assert(e.version > info.last_update)
>  ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x76) [0xc1e3a6]
>  2: ceph-osd() [0x7d5a7c]
>  3: (PG::append_log(std::vector > const&, eversion_t, eversion_t,
> ObjectStore::Transaction&, bool)+0x111) [0x7f7181]
>  4: (ReplicatedPG::log_operation(std::vector > const&,
> boost::optional&, eversion_t const&, eversion_t const&, bool,
> ObjectStore::Transaction*)+0xad) [0x8bfc7d]
>  5: (void ReplicatedBackend::sub_op_modify_impl(std::tr1::shared_ptr)+0x7b9)
> [0xa5e119]
>  6: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr)+0x4a) [0xa4950a]
>  7: (ReplicatedBackend::handle_message(std::tr1::shared_ptr)+0x363) [0xa49923]
>  8: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> ThreadPool::TPHandle&)+0x159) [0x847ae9]
>  9: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr,
> ThreadPool::TPHandle&)+0x3cf) [0x690cef]
>  10: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x469) [0x691359]
>  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89e)
> [0xc0d8ae]
>  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc0fa00]
>  13: (()+0x80a4) [0x7f55f9edd0a4]
>  14: (clone()+0x6d) [0x7f55f843904d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.
>
> I think this means that the PG log to be appended is newer than what
> is expected, but I'm not sure how to rectify it. Any pushes in the
> right direction would be helpful.
>
> It seems that this queue is helping with recover ops even when
> osd_max_backfills=20 with max client ops, but I don't have good long
> term data due to this issue. I think this has also impacted my SSD
> testing as I lose one OSD during the test, reducing the performance
> temporarily.
>
> When looking through my code, please remember.
> 1. This may be the first time I wrote C++ code, or it has been long
> enough it seems like it.
> 2. There is still some optimizations that I know can be done. But I'm
> happy to have people share any optimization opportunities they see.
> 3. I'm trying to understand the reason for the assert and pointers how
> to resolve it.
> 4. It seems like there are multiple threads of the queue keeping the
> queues pretty small. How can I limit the queue to one thread so all
> OPs have to be queued in one queue? I'd like to see the differences
> with changing this.
> 5. I'd like any pointers to improving this code.
>
> Thank you,
> Robert LeBlanc
>
> [1] https://github.com/ceph/ceph/compare/hammer...rldleblanc:wrr-queue
> [2] [rbd-test]
> #readwrite=write
> #blocksize=4M
> runtime=600
> name=rbd-test
> readwrite=randrw
> bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> rwmixread=72
> norandommap
> #size=1T
> #blocksize=4k
> ioengine=rbd
> rbdname=test5
> pool=ssd-pool
> clientname=admin
> iodepth=8
> numjobs=4
> thread
> group_reporting
> time_based
> #direct=1
> ramp_time=60
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWOjheCRDmVDuy+mK58QAAfxkQAJjgP4cjtHiFdtZgR2Zo
> yMPeV1b+ZYoQr4XbyCqWAsRdgigdcesCnjxyTOWnK+nHZgxMOgtHn8rylltV
> 17NzleGKfQUDRe7jLHLOaLDMphODvW0BjJHV8uk5DzYVJhVOhT5oHtJTtRXY
> JtMCIaGcwEPSP9IE+bkzX22fPEeNnkCHFAosmratD2WIeaNrOfV0DNOfAotO
> FX2/w0NtiuNqr+KEH3MrPdHkENXLhG2A8wiLqJ7sN0LvclwGbO9eZ01sv5nV
> bqqS8dQjd4oh31799vBroX73uMOb+ljeXNguz/4l4Tekn+F3m5puFHEX2o23
> NroU1YHNcKFAOwppZ7pDrAn3ATzvOEsZ7574dJw5vPxquCgsF0T8/phsk71D
> E1IOQC/EIqCw4wUnujwlEZXwlSXRLyqT5xUrSXo/qtM4HUz4PmWukxZxOmk/
> Afewcbq/5ElSZQus1xmMdmtGocSGAvMmYthIbXP+3l2127bMK2ptacL6VMSf
> uO+wYCLQZDnpjlx9DYt4CAEbEeuS4vCSzIkGishcuFNHGmM/gXXqYFybAATt
> IbLRWZBrq4TyfJe9sIp6aNPbi/IHxSV4NVVX3q1P2j91UDKKVL6hu9Ln0HTY
> UrFuDnH0yjvwBm4vJ0ksoWLIWTciLTTz68ZyOnOnr+uXGbkQEz1LzMQWZ+Cl
> saYj
> =R1Lm
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-04 19:49 ` Samuel Just
@ 2015-11-05  3:00   ` Robert LeBlanc
  2015-11-05  3:20     ` Gregory Farnum
  0 siblings, 1 reply; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-05  3:00 UTC (permalink / raw)
  To: Samuel Just; +Cc: ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Thanks for your help on IRC Samuel. I think I found where I made a
mistake. I'll do some more testing. So far with max_backfills=1 on
spindles, the impact of setting an OSD out and in on a saturated
cluster seems to be minimal. On my I/O graphs it is hard to tell where
the OSD was out and in recovering. If I/O becomes blocked, it seems
that they don't linger around long. All of the clients report getting
about the same amount of work done with little variance so no one
client is getting indefinitely blocked (or blocked for really long
times) causing the results between clients to be skewed like before.

So far this queue seems to be very positive. I'd hate to put a lot of
working getting this ready to merge if there is little interest in it
(a lot of things to do at work and some other things I'd like to track
down in the Ceph code as well). What are some of the next steps for
something like this, meaning a pretty significant change to core code?

Thank you to all who took time to help point me in the right direction.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Nov 4, 2015 at 12:49 PM, Samuel Just  wrote:
> I didn't look into it closely, but that almost certainly means that
> your queue is reordering primary->replica replicated write messages.
> -Sam
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWOsY1CRDmVDuy+mK58QAA9LIQALIUgbS4BuDS704HPOpA
XwvGxspelMCaBkLHLgiHU4T/Jc8JaXhgdRMwMiKeLI246Z7hRngSGlIDYc4+
nP4kWZIkwbJeTa/Z6bM6C3itFtJmQpkPvdjI+GiME5ZdYvFgCZQyDD71rqja
H14m0+JsEaIHQF0JZz6OyNxbyRWsM+M68nOvpAx8/fOGHBC/0VwPbLrOUP9O
3J3NvbhN9xlYJeivXSAyzxmHQDD8mO1c1AUTrHgnTViD2k3fmcH0mOHIJ+jn
ARZbeLN3hlXG0i9PHpnHzBVNSxsfb5VPxX970R3gvRWIt40QV/QL7q2SajWP
ofxgEpkaO48ANQSYDlqSNcM+w46TtgcJljtX0vbrHIW3Skyaz4UZQ/dzX4lX
a5Zzk01oFwXfMd10KgVbJf78qVYHy2r5aq46iFnrFLU43iy+Qve7Kex4XZFi
vPFFVea89Of838NqTxW21+3oJthrz1g7RKHghZAbXaj3WKchuEU+uVG4XTo1
0PU4a5ZYVTH6zYHpwJo2/89OzdkBe9S6s00+4JmfVWWEhb6+QwUjBQp1TJbB
TnMzSKfzgRyi/wHThv2XcZN12tttZMM2L4Ea3mHG+cxOTTZ1opv8/H2mprm8
7UuO4vk5K0c4IwPVmt9m5DTVhyn4hZ/QJmc+NARD3zc1u3qWFLkH2WaRMpBb
mRWA
=kgAl
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-05  3:00   ` Robert LeBlanc
@ 2015-11-05  3:20     ` Gregory Farnum
  2015-11-05 15:14       ` Robert LeBlanc
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Farnum @ 2015-11-05  3:20 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Samuel Just, ceph-devel

On Wed, Nov 4, 2015 at 7:00 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Thanks for your help on IRC Samuel. I think I found where I made a
> mistake. I'll do some more testing. So far with max_backfills=1 on
> spindles, the impact of setting an OSD out and in on a saturated
> cluster seems to be minimal. On my I/O graphs it is hard to tell where
> the OSD was out and in recovering. If I/O becomes blocked, it seems
> that they don't linger around long. All of the clients report getting
> about the same amount of work done with little variance so no one
> client is getting indefinitely blocked (or blocked for really long
> times) causing the results between clients to be skewed like before.
>
> So far this queue seems to be very positive. I'd hate to put a lot of
> working getting this ready to merge if there is little interest in it
> (a lot of things to do at work and some other things I'd like to track
> down in the Ceph code as well). What are some of the next steps for
> something like this, meaning a pretty significant change to core code?

Well, step one is to convince people it's worthwhile. Your performance
information and anecdotal evidence of client impact is a pretty good
start. For it to get merged:
1) People will need to review it and verify it's not breaking anything
they can identify from code. Things are a bit constricted right now,
but this is pretty small and of high interest so I make no promises
for the core team but submitting a PR will be the way to start.
Getting positive buy-in from other contributors who are interested in
performance will also push it up the queue.
2) There will need to be a lot of testing on something like this.
Everything has to pass a run of the RADOS suite. Unfortunately this is
a bad month for that as the lab is getting physically shipped around
in a few weeks, so if you can afford to make it happen with the
teuthology-openstack stuff that will accelerate the timeline a lot (we
will still need to run it ourselves but once it's passed externally we
can put it in a lot more test runs we expect to pass, instead of in a
bucket with others that will all get blocked on any one failure).
3) For a new queuing system I suspect that rather than a direct merge
to default master, Sam will want to keep both in the code for a while
with a config value and run a lot of the nightlies on this one to
tease out any subtle races and bugs.
4) Eventually we become confident that it's in good shape and it
replaces the old queue.

Obviously those are the optimistic steps. ;)
-Greg

>
> Thank you to all who took time to help point me in the right direction.
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Nov 4, 2015 at 12:49 PM, Samuel Just  wrote:
>> I didn't look into it closely, but that almost certainly means that
>> your queue is reordering primary->replica replicated write messages.
>> -Sam
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWOsY1CRDmVDuy+mK58QAA9LIQALIUgbS4BuDS704HPOpA
> XwvGxspelMCaBkLHLgiHU4T/Jc8JaXhgdRMwMiKeLI246Z7hRngSGlIDYc4+
> nP4kWZIkwbJeTa/Z6bM6C3itFtJmQpkPvdjI+GiME5ZdYvFgCZQyDD71rqja
> H14m0+JsEaIHQF0JZz6OyNxbyRWsM+M68nOvpAx8/fOGHBC/0VwPbLrOUP9O
> 3J3NvbhN9xlYJeivXSAyzxmHQDD8mO1c1AUTrHgnTViD2k3fmcH0mOHIJ+jn
> ARZbeLN3hlXG0i9PHpnHzBVNSxsfb5VPxX970R3gvRWIt40QV/QL7q2SajWP
> ofxgEpkaO48ANQSYDlqSNcM+w46TtgcJljtX0vbrHIW3Skyaz4UZQ/dzX4lX
> a5Zzk01oFwXfMd10KgVbJf78qVYHy2r5aq46iFnrFLU43iy+Qve7Kex4XZFi
> vPFFVea89Of838NqTxW21+3oJthrz1g7RKHghZAbXaj3WKchuEU+uVG4XTo1
> 0PU4a5ZYVTH6zYHpwJo2/89OzdkBe9S6s00+4JmfVWWEhb6+QwUjBQp1TJbB
> TnMzSKfzgRyi/wHThv2XcZN12tttZMM2L4Ea3mHG+cxOTTZ1opv8/H2mprm8
> 7UuO4vk5K0c4IwPVmt9m5DTVhyn4hZ/QJmc+NARD3zc1u3qWFLkH2WaRMpBb
> mRWA
> =kgAl
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-05  3:20     ` Gregory Farnum
@ 2015-11-05 15:14       ` Robert LeBlanc
  2015-11-05 15:16         ` Mark Nelson
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-05 15:14 UTC (permalink / raw)
  To: ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Thanks Gregory,

People are most likely busy and haven't had time to digest this and I
may be expecting more excitement from it (I'm excited due to the
results and probably also that such a large change still works). I'll
keep working towards a PR, this was mostly proof of concept, now that
there is some data I'll clean up the code.

I was thinking that a config option to choose the scheduler would be a
good idea. In terms of the project what is the better approach: create
a new template and each place the template class is instantiated
select the queue, or perform the queue selection in the same template
class, or something else I haven't thought of.

Are there public teuthology-openstack systems that could be used for
testing? I don't remember, I'll have to search back through the
mailing list archives.

I appreciate all the direction as I've tried to figure this out.

Thanks,
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 4, 2015 at 8:20 PM, Gregory Farnum  wrote:
> On Wed, Nov 4, 2015 at 7:00 PM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Thanks for your help on IRC Samuel. I think I found where I made a
>> mistake. I'll do some more testing. So far with max_backfills=1 on
>> spindles, the impact of setting an OSD out and in on a saturated
>> cluster seems to be minimal. On my I/O graphs it is hard to tell where
>> the OSD was out and in recovering. If I/O becomes blocked, it seems
>> that they don't linger around long. All of the clients report getting
>> about the same amount of work done with little variance so no one
>> client is getting indefinitely blocked (or blocked for really long
>> times) causing the results between clients to be skewed like before.
>>
>> So far this queue seems to be very positive. I'd hate to put a lot of
>> working getting this ready to merge if there is little interest in it
>> (a lot of things to do at work and some other things I'd like to track
>> down in the Ceph code as well). What are some of the next steps for
>> something like this, meaning a pretty significant change to core code?
>
> Well, step one is to convince people it's worthwhile. Your performance
> information and anecdotal evidence of client impact is a pretty good
> start. For it to get merged:
> 1) People will need to review it and verify it's not breaking anything
> they can identify from code. Things are a bit constricted right now,
> but this is pretty small and of high interest so I make no promises
> for the core team but submitting a PR will be the way to start.
> Getting positive buy-in from other contributors who are interested in
> performance will also push it up the queue.
> 2) There will need to be a lot of testing on something like this.
> Everything has to pass a run of the RADOS suite. Unfortunately this is
> a bad month for that as the lab is getting physically shipped around
> in a few weeks, so if you can afford to make it happen with the
> teuthology-openstack stuff that will accelerate the timeline a lot (we
> will still need to run it ourselves but once it's passed externally we
> can put it in a lot more test runs we expect to pass, instead of in a
> bucket with others that will all get blocked on any one failure).
> 3) For a new queuing system I suspect that rather than a direct merge
> to default master, Sam will want to keep both in the code for a while
> with a config value and run a lot of the nightlies on this one to
> tease out any subtle races and bugs.
> 4) Eventually we become confident that it's in good shape and it
> replaces the old queue.
>
> Obviously those are the optimistic steps. ;)
> -Greg
>
>>
>> Thank you to all who took time to help point me in the right direction.
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Wed, Nov 4, 2015 at 12:49 PM, Samuel Just  wrote:
>>> I didn't look into it closely, but that almost certainly means that
>>> your queue is reordering primary->replica replicated write messages.
>>> -Sam
>>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWOsY1CRDmVDuy+mK58QAA9LIQALIUgbS4BuDS704HPOpA
>> XwvGxspelMCaBkLHLgiHU4T/Jc8JaXhgdRMwMiKeLI246Z7hRngSGlIDYc4+
>> nP4kWZIkwbJeTa/Z6bM6C3itFtJmQpkPvdjI+GiME5ZdYvFgCZQyDD71rqja
>> H14m0+JsEaIHQF0JZz6OyNxbyRWsM+M68nOvpAx8/fOGHBC/0VwPbLrOUP9O
>> 3J3NvbhN9xlYJeivXSAyzxmHQDD8mO1c1AUTrHgnTViD2k3fmcH0mOHIJ+jn
>> ARZbeLN3hlXG0i9PHpnHzBVNSxsfb5VPxX970R3gvRWIt40QV/QL7q2SajWP
>> ofxgEpkaO48ANQSYDlqSNcM+w46TtgcJljtX0vbrHIW3Skyaz4UZQ/dzX4lX
>> a5Zzk01oFwXfMd10KgVbJf78qVYHy2r5aq46iFnrFLU43iy+Qve7Kex4XZFi
>> vPFFVea89Of838NqTxW21+3oJthrz1g7RKHghZAbXaj3WKchuEU+uVG4XTo1
>> 0PU4a5ZYVTH6zYHpwJo2/89OzdkBe9S6s00+4JmfVWWEhb6+QwUjBQp1TJbB
>> TnMzSKfzgRyi/wHThv2XcZN12tttZMM2L4Ea3mHG+cxOTTZ1opv8/H2mprm8
>> 7UuO4vk5K0c4IwPVmt9m5DTVhyn4hZ/QJmc+NARD3zc1u3qWFLkH2WaRMpBb
>> mRWA
>> =kgAl
>> -----END PGP SIGNATURE-----
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWO3JgCRDmVDuy+mK58QAAeCcP/jHnG3r257cdcRYZzg9o
iMOxnuKAXNnwscYzJysCHsoQ2S3dB9SCxt8r+QvDo09IkXzarFaW647nzG6H
zeCtbhx2NFU/jOqPip/8XDUaDYlDrjHuskDJwz+jzoaZfWPjLfPkmETU/8vh
mrGZH+kjYuu1WhmM8cGJZJLrKA7C2OPTAU5PRmx5enClHXhdxyskZ7BUxcXp
uPJJg7pemT/qaJPrO7e7wwhYw43GaeSULp8QGFsqireCwbv9mndB7bbOa40U
ElHmgWgcG1UkkydW/U9DaJHM52ZbrAuG7XkZRsmB1oTmVriEoOFYSiGv5F+R
Mjxe9OlqiL9Fd/AQXunAAMdwIU5T3mlkrxMvhroRkW2+EerrRVW3JbJ8gmQ9
lXPRw9RxcQY5m8S+8+CWikBHvsRBCXEGA8tXUYqLuDJKpRHeCo7PpONS3III
QB+tgWaMteoeJGZ7nGLFcaKxTGa1tNKju4M2845/L8Fawy8jdYYcLqOTUs80
M1gpQ0UHzTXdQEdQnufxgaCFfwblF5vIlr6qd89rR5m0eJipElQLi2Uh0Zd3
0t0i0xtFdprkxDmzX/bzbARAnlS1cz/yoB85r3JxeNPev671mocQc0uyFkt7
P04ogGWzLBN5B4nWNWDznOZS52G+vhkFxryUyl9+LDafAKiTTPmhB/LXPMs+
7ny7
=Xg0t
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-05 15:14       ` Robert LeBlanc
@ 2015-11-05 15:16         ` Mark Nelson
  2015-11-05 15:46         ` Gregory Farnum
  2015-11-06 10:12         ` Sage Weil
  2 siblings, 0 replies; 24+ messages in thread
From: Mark Nelson @ 2015-11-05 15:16 UTC (permalink / raw)
  To: Robert LeBlanc, ceph-devel

Hi Robert,

It definitely is exciting I think. Keep up the good work! :)

Mark

On 11/05/2015 09:14 AM, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Thanks Gregory,
>
> People are most likely busy and haven't had time to digest this and I
> may be expecting more excitement from it (I'm excited due to the
> results and probably also that such a large change still works). I'll
> keep working towards a PR, this was mostly proof of concept, now that
> there is some data I'll clean up the code.
>
> I was thinking that a config option to choose the scheduler would be a
> good idea. In terms of the project what is the better approach: create
> a new template and each place the template class is instantiated
> select the queue, or perform the queue selection in the same template
> class, or something else I haven't thought of.
>
> Are there public teuthology-openstack systems that could be used for
> testing? I don't remember, I'll have to search back through the
> mailing list archives.
>
> I appreciate all the direction as I've tried to figure this out.
>
> Thanks,
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Nov 4, 2015 at 8:20 PM, Gregory Farnum  wrote:
>> On Wed, Nov 4, 2015 at 7:00 PM, Robert LeBlanc  wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> Thanks for your help on IRC Samuel. I think I found where I made a
>>> mistake. I'll do some more testing. So far with max_backfills=1 on
>>> spindles, the impact of setting an OSD out and in on a saturated
>>> cluster seems to be minimal. On my I/O graphs it is hard to tell where
>>> the OSD was out and in recovering. If I/O becomes blocked, it seems
>>> that they don't linger around long. All of the clients report getting
>>> about the same amount of work done with little variance so no one
>>> client is getting indefinitely blocked (or blocked for really long
>>> times) causing the results between clients to be skewed like before.
>>>
>>> So far this queue seems to be very positive. I'd hate to put a lot of
>>> working getting this ready to merge if there is little interest in it
>>> (a lot of things to do at work and some other things I'd like to track
>>> down in the Ceph code as well). What are some of the next steps for
>>> something like this, meaning a pretty significant change to core code?
>>
>> Well, step one is to convince people it's worthwhile. Your performance
>> information and anecdotal evidence of client impact is a pretty good
>> start. For it to get merged:
>> 1) People will need to review it and verify it's not breaking anything
>> they can identify from code. Things are a bit constricted right now,
>> but this is pretty small and of high interest so I make no promises
>> for the core team but submitting a PR will be the way to start.
>> Getting positive buy-in from other contributors who are interested in
>> performance will also push it up the queue.
>> 2) There will need to be a lot of testing on something like this.
>> Everything has to pass a run of the RADOS suite. Unfortunately this is
>> a bad month for that as the lab is getting physically shipped around
>> in a few weeks, so if you can afford to make it happen with the
>> teuthology-openstack stuff that will accelerate the timeline a lot (we
>> will still need to run it ourselves but once it's passed externally we
>> can put it in a lot more test runs we expect to pass, instead of in a
>> bucket with others that will all get blocked on any one failure).
>> 3) For a new queuing system I suspect that rather than a direct merge
>> to default master, Sam will want to keep both in the code for a while
>> with a config value and run a lot of the nightlies on this one to
>> tease out any subtle races and bugs.
>> 4) Eventually we become confident that it's in good shape and it
>> replaces the old queue.
>>
>> Obviously those are the optimistic steps. ;)
>> -Greg
>>
>>>
>>> Thank you to all who took time to help point me in the right direction.
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Wed, Nov 4, 2015 at 12:49 PM, Samuel Just  wrote:
>>>> I didn't look into it closely, but that almost certainly means that
>>>> your queue is reordering primary->replica replicated write messages.
>>>> -Sam
>>>>
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.3
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWOsY1CRDmVDuy+mK58QAA9LIQALIUgbS4BuDS704HPOpA
>>> XwvGxspelMCaBkLHLgiHU4T/Jc8JaXhgdRMwMiKeLI246Z7hRngSGlIDYc4+
>>> nP4kWZIkwbJeTa/Z6bM6C3itFtJmQpkPvdjI+GiME5ZdYvFgCZQyDD71rqja
>>> H14m0+JsEaIHQF0JZz6OyNxbyRWsM+M68nOvpAx8/fOGHBC/0VwPbLrOUP9O
>>> 3J3NvbhN9xlYJeivXSAyzxmHQDD8mO1c1AUTrHgnTViD2k3fmcH0mOHIJ+jn
>>> ARZbeLN3hlXG0i9PHpnHzBVNSxsfb5VPxX970R3gvRWIt40QV/QL7q2SajWP
>>> ofxgEpkaO48ANQSYDlqSNcM+w46TtgcJljtX0vbrHIW3Skyaz4UZQ/dzX4lX
>>> a5Zzk01oFwXfMd10KgVbJf78qVYHy2r5aq46iFnrFLU43iy+Qve7Kex4XZFi
>>> vPFFVea89Of838NqTxW21+3oJthrz1g7RKHghZAbXaj3WKchuEU+uVG4XTo1
>>> 0PU4a5ZYVTH6zYHpwJo2/89OzdkBe9S6s00+4JmfVWWEhb6+QwUjBQp1TJbB
>>> TnMzSKfzgRyi/wHThv2XcZN12tttZMM2L4Ea3mHG+cxOTTZ1opv8/H2mprm8
>>> 7UuO4vk5K0c4IwPVmt9m5DTVhyn4hZ/QJmc+NARD3zc1u3qWFLkH2WaRMpBb
>>> mRWA
>>> =kgAl
>>> -----END PGP SIGNATURE-----
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWO3JgCRDmVDuy+mK58QAAeCcP/jHnG3r257cdcRYZzg9o
> iMOxnuKAXNnwscYzJysCHsoQ2S3dB9SCxt8r+QvDo09IkXzarFaW647nzG6H
> zeCtbhx2NFU/jOqPip/8XDUaDYlDrjHuskDJwz+jzoaZfWPjLfPkmETU/8vh
> mrGZH+kjYuu1WhmM8cGJZJLrKA7C2OPTAU5PRmx5enClHXhdxyskZ7BUxcXp
> uPJJg7pemT/qaJPrO7e7wwhYw43GaeSULp8QGFsqireCwbv9mndB7bbOa40U
> ElHmgWgcG1UkkydW/U9DaJHM52ZbrAuG7XkZRsmB1oTmVriEoOFYSiGv5F+R
> Mjxe9OlqiL9Fd/AQXunAAMdwIU5T3mlkrxMvhroRkW2+EerrRVW3JbJ8gmQ9
> lXPRw9RxcQY5m8S+8+CWikBHvsRBCXEGA8tXUYqLuDJKpRHeCo7PpONS3III
> QB+tgWaMteoeJGZ7nGLFcaKxTGa1tNKju4M2845/L8Fawy8jdYYcLqOTUs80
> M1gpQ0UHzTXdQEdQnufxgaCFfwblF5vIlr6qd89rR5m0eJipElQLi2Uh0Zd3
> 0t0i0xtFdprkxDmzX/bzbARAnlS1cz/yoB85r3JxeNPev671mocQc0uyFkt7
> P04ogGWzLBN5B4nWNWDznOZS52G+vhkFxryUyl9+LDafAKiTTPmhB/LXPMs+
> 7ny7
> =Xg0t
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-05 15:14       ` Robert LeBlanc
  2015-11-05 15:16         ` Mark Nelson
@ 2015-11-05 15:46         ` Gregory Farnum
  2015-11-06 10:12         ` Sage Weil
  2 siblings, 0 replies; 24+ messages in thread
From: Gregory Farnum @ 2015-11-05 15:46 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-devel

On Thu, Nov 5, 2015 at 7:14 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Thanks Gregory,
>
> People are most likely busy and haven't had time to digest this and I
> may be expecting more excitement from it (I'm excited due to the
> results and probably also that such a large change still works). I'll
> keep working towards a PR, this was mostly proof of concept, now that
> there is some data I'll clean up the code.
>
> I was thinking that a config option to choose the scheduler would be a
> good idea. In terms of the project what is the better approach: create
> a new template and each place the template class is instantiated
> select the queue, or perform the queue selection in the same template
> class, or something else I haven't thought of.
>
> Are there public teuthology-openstack systems that could be used for
> testing? I don't remember, I'll have to search back through the
> mailing list archives.

Loic has sent some emails about this. It's a way to invoke the ceph
tests on an openstack cluster of your choice. I think he said a
reasonable rados run costs ~$20 on ovh? (You'll want to make sure to
use the subset functionality so it doesn't do the full geometric
expansion of the possible test combinations.)
-Greg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-05 15:14       ` Robert LeBlanc
  2015-11-05 15:16         ` Mark Nelson
  2015-11-05 15:46         ` Gregory Farnum
@ 2015-11-06 10:12         ` Sage Weil
  2015-11-06 17:03           ` Robert LeBlanc
  2 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2015-11-06 10:12 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-devel

On Thu, 5 Nov 2015, Robert LeBlanc wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Thanks Gregory,
> 
> People are most likely busy and haven't had time to digest this and I
> may be expecting more excitement from it (I'm excited due to the
> results and probably also that such a large change still works). I'll
> keep working towards a PR, this was mostly proof of concept, now that
> there is some data I'll clean up the code.

I'm *very* excited about this.  This is something that almost every 
operator has problems with so it's very encouraging to see that switching 
up the queue has a big impact in your environment.

I'm just following up on this after a week of travel, so apologies if this 
is covered already, but did you compare this implementation to the 
original one with the same tunables?  I see somewhere that you had 
max_backfills=20 at some point, which is going to be bad regardless of the 
queue.

I also see that you chnaged the strict priority threshold from LOW to HIGH 
in OSD.cc; I'm curious how much of an impact was from this vs the queue 
implementation.
 
> I was thinking that a config option to choose the scheduler would be a
> good idea. In terms of the project what is the better approach: create
> a new template and each place the template class is instantiated
> select the queue, or perform the queue selection in the same template
> class, or something else I haven't thought of.

A config option would be nice, but I'd start by just cleaning up the code 
and putting it in a new class (WeightedRoundRobinPriorityQueue or 
whatever).  If we find that it's behaving better I'm not sure how much 
value we get from a tunable.  Note that there is one other user 
(msgr/simple/DispatchQueue) that we might also was to switch over at some 
point.. especially if this implementation is faster.

Once it's cleaned up (remove commented out code, new class) put it up as a 
PR and we can review and get it through testing.

Thanks, Robert!
sage


> 
> Are there public teuthology-openstack systems that could be used for
> testing? I don't remember, I'll have to search back through the
> mailing list archives.
> 
> I appreciate all the direction as I've tried to figure this out.
> 
> Thanks,
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Wed, Nov 4, 2015 at 8:20 PM, Gregory Farnum  wrote:
> > On Wed, Nov 4, 2015 at 7:00 PM, Robert LeBlanc  wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> Thanks for your help on IRC Samuel. I think I found where I made a
> >> mistake. I'll do some more testing. So far with max_backfills=1 on
> >> spindles, the impact of setting an OSD out and in on a saturated
> >> cluster seems to be minimal. On my I/O graphs it is hard to tell where
> >> the OSD was out and in recovering. If I/O becomes blocked, it seems
> >> that they don't linger around long. All of the clients report getting
> >> about the same amount of work done with little variance so no one
> >> client is getting indefinitely blocked (or blocked for really long
> >> times) causing the results between clients to be skewed like before.
> >>
> >> So far this queue seems to be very positive. I'd hate to put a lot of
> >> working getting this ready to merge if there is little interest in it
> >> (a lot of things to do at work and some other things I'd like to track
> >> down in the Ceph code as well). What are some of the next steps for
> >> something like this, meaning a pretty significant change to core code?
> >
> > Well, step one is to convince people it's worthwhile. Your performance
> > information and anecdotal evidence of client impact is a pretty good
> > start. For it to get merged:
> > 1) People will need to review it and verify it's not breaking anything
> > they can identify from code. Things are a bit constricted right now,
> > but this is pretty small and of high interest so I make no promises
> > for the core team but submitting a PR will be the way to start.
> > Getting positive buy-in from other contributors who are interested in
> > performance will also push it up the queue.
> > 2) There will need to be a lot of testing on something like this.
> > Everything has to pass a run of the RADOS suite. Unfortunately this is
> > a bad month for that as the lab is getting physically shipped around
> > in a few weeks, so if you can afford to make it happen with the
> > teuthology-openstack stuff that will accelerate the timeline a lot (we
> > will still need to run it ourselves but once it's passed externally we
> > can put it in a lot more test runs we expect to pass, instead of in a
> > bucket with others that will all get blocked on any one failure).
> > 3) For a new queuing system I suspect that rather than a direct merge
> > to default master, Sam will want to keep both in the code for a while
> > with a config value and run a lot of the nightlies on this one to
> > tease out any subtle races and bugs.
> > 4) Eventually we become confident that it's in good shape and it
> > replaces the old queue.
> >
> > Obviously those are the optimistic steps. ;)
> > -Greg
> >
> >>
> >> Thank you to all who took time to help point me in the right direction.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Wed, Nov 4, 2015 at 12:49 PM, Samuel Just  wrote:
> >>> I didn't look into it closely, but that almost certainly means that
> >>> your queue is reordering primary->replica replicated write messages.
> >>> -Sam
> >>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.2.3
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWOsY1CRDmVDuy+mK58QAA9LIQALIUgbS4BuDS704HPOpA
> >> XwvGxspelMCaBkLHLgiHU4T/Jc8JaXhgdRMwMiKeLI246Z7hRngSGlIDYc4+
> >> nP4kWZIkwbJeTa/Z6bM6C3itFtJmQpkPvdjI+GiME5ZdYvFgCZQyDD71rqja
> >> H14m0+JsEaIHQF0JZz6OyNxbyRWsM+M68nOvpAx8/fOGHBC/0VwPbLrOUP9O
> >> 3J3NvbhN9xlYJeivXSAyzxmHQDD8mO1c1AUTrHgnTViD2k3fmcH0mOHIJ+jn
> >> ARZbeLN3hlXG0i9PHpnHzBVNSxsfb5VPxX970R3gvRWIt40QV/QL7q2SajWP
> >> ofxgEpkaO48ANQSYDlqSNcM+w46TtgcJljtX0vbrHIW3Skyaz4UZQ/dzX4lX
> >> a5Zzk01oFwXfMd10KgVbJf78qVYHy2r5aq46iFnrFLU43iy+Qve7Kex4XZFi
> >> vPFFVea89Of838NqTxW21+3oJthrz1g7RKHghZAbXaj3WKchuEU+uVG4XTo1
> >> 0PU4a5ZYVTH6zYHpwJo2/89OzdkBe9S6s00+4JmfVWWEhb6+QwUjBQp1TJbB
> >> TnMzSKfzgRyi/wHThv2XcZN12tttZMM2L4Ea3mHG+cxOTTZ1opv8/H2mprm8
> >> 7UuO4vk5K0c4IwPVmt9m5DTVhyn4hZ/QJmc+NARD3zc1u3qWFLkH2WaRMpBb
> >> mRWA
> >> =kgAl
> >> -----END PGP SIGNATURE-----
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWO3JgCRDmVDuy+mK58QAAeCcP/jHnG3r257cdcRYZzg9o
> iMOxnuKAXNnwscYzJysCHsoQ2S3dB9SCxt8r+QvDo09IkXzarFaW647nzG6H
> zeCtbhx2NFU/jOqPip/8XDUaDYlDrjHuskDJwz+jzoaZfWPjLfPkmETU/8vh
> mrGZH+kjYuu1WhmM8cGJZJLrKA7C2OPTAU5PRmx5enClHXhdxyskZ7BUxcXp
> uPJJg7pemT/qaJPrO7e7wwhYw43GaeSULp8QGFsqireCwbv9mndB7bbOa40U
> ElHmgWgcG1UkkydW/U9DaJHM52ZbrAuG7XkZRsmB1oTmVriEoOFYSiGv5F+R
> Mjxe9OlqiL9Fd/AQXunAAMdwIU5T3mlkrxMvhroRkW2+EerrRVW3JbJ8gmQ9
> lXPRw9RxcQY5m8S+8+CWikBHvsRBCXEGA8tXUYqLuDJKpRHeCo7PpONS3III
> QB+tgWaMteoeJGZ7nGLFcaKxTGa1tNKju4M2845/L8Fawy8jdYYcLqOTUs80
> M1gpQ0UHzTXdQEdQnufxgaCFfwblF5vIlr6qd89rR5m0eJipElQLi2Uh0Zd3
> 0t0i0xtFdprkxDmzX/bzbARAnlS1cz/yoB85r3JxeNPev671mocQc0uyFkt7
> P04ogGWzLBN5B4nWNWDznOZS52G+vhkFxryUyl9+LDafAKiTTPmhB/LXPMs+
> 7ny7
> =Xg0t
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-06 10:12         ` Sage Weil
@ 2015-11-06 17:03           ` Robert LeBlanc
  2015-11-06 17:16             ` Milosz Tanski
  2015-11-07  1:39             ` Robert LeBlanc
  0 siblings, 2 replies; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-06 17:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Thanks Gregory,
>>
>> People are most likely busy and haven't had time to digest this and I
>> may be expecting more excitement from it (I'm excited due to the
>> results and probably also that such a large change still works). I'll
>> keep working towards a PR, this was mostly proof of concept, now that
>> there is some data I'll clean up the code.
>
> I'm *very* excited about this.  This is something that almost every
> operator has problems with so it's very encouraging to see that switching
> up the queue has a big impact in your environment.
>
> I'm just following up on this after a week of travel, so apologies if this
> is covered already, but did you compare this implementation to the
> original one with the same tunables?  I see somewhere that you had
> max_backfills=20 at some point, which is going to be bad regardless of the
> queue.
>
> I also see that you chnaged the strict priority threshold from LOW to HIGH
> in OSD.cc; I'm curious how much of an impact was from this vs the queue
> implementation.

Yes max_backfills=20 is problematic for both queues and from what I
can tell is because the OPs are waiting for PGs to get healthy. In a
busy cluster it can take a while due to the recovery ops having low
priority. In the current queue, it is possible to be blocked for a
long time. The new queue seems to prevent that, but they do still back
up. After this, I think I'd like to look into promoting recovery OPs
that are blocking client OPs to higher priorities so that client I/O
doesn't suffer as much during recovery. I think that will be a very
different problem to tackle because I don't think I can do the proper
introspection at the queue level. I'll have to do that logic in OSD.cc
or PG.cc.

The strict priority threshold didn't make much of a difference with
the original queue. I initially eliminated it all together in the WRR,
but there were times that peering would never complete. I want to get
as many OPs in the WRR queue to provide fairness as much as possible.
I haven't tweaked the setting much in the WRR queue yet.

>
>> I was thinking that a config option to choose the scheduler would be a
>> good idea. In terms of the project what is the better approach: create
>> a new template and each place the template class is instantiated
>> select the queue, or perform the queue selection in the same template
>> class, or something else I haven't thought of.
>
> A config option would be nice, but I'd start by just cleaning up the code
> and putting it in a new class (WeightedRoundRobinPriorityQueue or
> whatever).  If we find that it's behaving better I'm not sure how much
> value we get from a tunable.  Note that there is one other user
> (msgr/simple/DispatchQueue) that we might also was to switch over at some
> point.. especially if this implementation is faster.
>
> Once it's cleaned up (remove commented out code, new class) put it up as a
> PR and we can review and get it through testing.

In talking with Samuel in IRC, we think creating an abstract class for
the queue is the best option. C++11 allows you to still optimize
abstract template classes if you use final in the derived class (I
verified the assembly). I'm planning to refactor the code so that
similar code can be reused between queues and allows more flexibility
in the future (components can chose the queue that works the best for
them, etc). The test for which queue to use should be a very simple
comparison and it would allow us to let it bake much longer. I hope to
have a PR mid next week.

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
+tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
FKe/
=yodk
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-06 17:03           ` Robert LeBlanc
@ 2015-11-06 17:16             ` Milosz Tanski
  2015-11-07  1:39             ` Robert LeBlanc
  1 sibling, 0 replies; 24+ messages in thread
From: Milosz Tanski @ 2015-11-06 17:16 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, ceph-devel

On Fri, Nov 6, 2015 at 12:03 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
> > On Thu, 5 Nov 2015, Robert LeBlanc wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> Thanks Gregory,
> >>
> >> People are most likely busy and haven't had time to digest this and I
> >> may be expecting more excitement from it (I'm excited due to the
> >> results and probably also that such a large change still works). I'll
> >> keep working towards a PR, this was mostly proof of concept, now that
> >> there is some data I'll clean up the code.
> >
> > I'm *very* excited about this.  This is something that almost every
> > operator has problems with so it's very encouraging to see that switching
> > up the queue has a big impact in your environment.
> >
> > I'm just following up on this after a week of travel, so apologies if this
> > is covered already, but did you compare this implementation to the
> > original one with the same tunables?  I see somewhere that you had
> > max_backfills=20 at some point, which is going to be bad regardless of the
> > queue.
> >
> > I also see that you chnaged the strict priority threshold from LOW to HIGH
> > in OSD.cc; I'm curious how much of an impact was from this vs the queue
> > implementation.
>
> Yes max_backfills=20 is problematic for both queues and from what I
> can tell is because the OPs are waiting for PGs to get healthy. In a
> busy cluster it can take a while due to the recovery ops having low
> priority. In the current queue, it is possible to be blocked for a
> long time. The new queue seems to prevent that, but they do still back
> up. After this, I think I'd like to look into promoting recovery OPs
> that are blocking client OPs to higher priorities so that client I/O
> doesn't suffer as much during recovery. I think that will be a very
> different problem to tackle because I don't think I can do the proper
> introspection at the queue level. I'll have to do that logic in OSD.cc
> or PG.cc.
>
> The strict priority threshold didn't make much of a difference with
> the original queue. I initially eliminated it all together in the WRR,
> but there were times that peering would never complete. I want to get
> as many OPs in the WRR queue to provide fairness as much as possible.
> I haven't tweaked the setting much in the WRR queue yet.
>
> >
> >> I was thinking that a config option to choose the scheduler would be a
> >> good idea. In terms of the project what is the better approach: create
> >> a new template and each place the template class is instantiated
> >> select the queue, or perform the queue selection in the same template
> >> class, or something else I haven't thought of.
> >
> > A config option would be nice, but I'd start by just cleaning up the code
> > and putting it in a new class (WeightedRoundRobinPriorityQueue or
> > whatever).  If we find that it's behaving better I'm not sure how much
> > value we get from a tunable.  Note that there is one other user
> > (msgr/simple/DispatchQueue) that we might also was to switch over at some
> > point.. especially if this implementation is faster.
> >
> > Once it's cleaned up (remove commented out code, new class) put it up as a
> > PR and we can review and get it through testing.
>
> In talking with Samuel in IRC, we think creating an abstract class for
> the queue is the best option. C++11 allows you to still optimize
> abstract template classes if you use final in the derived class (I
> verified the assembly). I'm planning to refactor the code so that
> similar code can be reused between queues and allows more flexibility
> in the future (components can chose the queue that works the best for
> them, etc). The test for which queue to use should be a very simple
> comparison and it would allow us to let it bake much longer. I hope to
> have a PR mid next week.
>

If you build with LTO the compiler/link will preform de-virtualization
in many cases even when you forget to include the final class modifier
keyword.

>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
> xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
> 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
> AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
> v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
> aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
> Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
> ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
> +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
> pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
> w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
> QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
> FKe/
> =yodk
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-06 17:03           ` Robert LeBlanc
  2015-11-06 17:16             ` Milosz Tanski
@ 2015-11-07  1:39             ` Robert LeBlanc
  2015-11-08 14:20               ` Sage Weil
  1 sibling, 1 reply; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-07  1:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

After trying to look through the recovery code, I'm getting the
feeling that recovery OPs are not scheduled in the OP queue that I've
been working on. Does that sound right? In the OSD logs I'm only
seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
If the recovery is in another separate queue, then there is no
reliable way to prioritize OPs between them.

If I'm going off in to the weeds, please help me get back on the trail.

Thanks,
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> Thanks Gregory,
>>>
>>> People are most likely busy and haven't had time to digest this and I
>>> may be expecting more excitement from it (I'm excited due to the
>>> results and probably also that such a large change still works). I'll
>>> keep working towards a PR, this was mostly proof of concept, now that
>>> there is some data I'll clean up the code.
>>
>> I'm *very* excited about this.  This is something that almost every
>> operator has problems with so it's very encouraging to see that switching
>> up the queue has a big impact in your environment.
>>
>> I'm just following up on this after a week of travel, so apologies if this
>> is covered already, but did you compare this implementation to the
>> original one with the same tunables?  I see somewhere that you had
>> max_backfills=20 at some point, which is going to be bad regardless of the
>> queue.
>>
>> I also see that you chnaged the strict priority threshold from LOW to HIGH
>> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>> implementation.
>
> Yes max_backfills=20 is problematic for both queues and from what I
> can tell is because the OPs are waiting for PGs to get healthy. In a
> busy cluster it can take a while due to the recovery ops having low
> priority. In the current queue, it is possible to be blocked for a
> long time. The new queue seems to prevent that, but they do still back
> up. After this, I think I'd like to look into promoting recovery OPs
> that are blocking client OPs to higher priorities so that client I/O
> doesn't suffer as much during recovery. I think that will be a very
> different problem to tackle because I don't think I can do the proper
> introspection at the queue level. I'll have to do that logic in OSD.cc
> or PG.cc.
>
> The strict priority threshold didn't make much of a difference with
> the original queue. I initially eliminated it all together in the WRR,
> but there were times that peering would never complete. I want to get
> as many OPs in the WRR queue to provide fairness as much as possible.
> I haven't tweaked the setting much in the WRR queue yet.
>
>>
>>> I was thinking that a config option to choose the scheduler would be a
>>> good idea. In terms of the project what is the better approach: create
>>> a new template and each place the template class is instantiated
>>> select the queue, or perform the queue selection in the same template
>>> class, or something else I haven't thought of.
>>
>> A config option would be nice, but I'd start by just cleaning up the code
>> and putting it in a new class (WeightedRoundRobinPriorityQueue or
>> whatever).  If we find that it's behaving better I'm not sure how much
>> value we get from a tunable.  Note that there is one other user
>> (msgr/simple/DispatchQueue) that we might also was to switch over at some
>> point.. especially if this implementation is faster.
>>
>> Once it's cleaned up (remove commented out code, new class) put it up as a
>> PR and we can review and get it through testing.
>
> In talking with Samuel in IRC, we think creating an abstract class for
> the queue is the best option. C++11 allows you to still optimize
> abstract template classes if you use final in the derived class (I
> verified the assembly). I'm planning to refactor the code so that
> similar code can be reused between queues and allows more flexibility
> in the future (components can chose the queue that works the best for
> them, etc). The test for which queue to use should be a very simple
> comparison and it would allow us to let it bake much longer. I hope to
> have a PR mid next week.
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
> xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
> 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
> AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
> v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
> aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
> Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
> ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
> +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
> pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
> w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
> QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
> FKe/
> =yodk
> -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN
pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE
sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT
GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I
TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q
X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV
R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo
kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu
GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J
Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt
28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu
FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT
ABCA
=yXJO
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-07  1:39             ` Robert LeBlanc
@ 2015-11-08 14:20               ` Sage Weil
  2015-11-09 16:49                 ` Samuel Just
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2015-11-08 14:20 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-devel

On Fri, 6 Nov 2015, Robert LeBlanc wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> After trying to look through the recovery code, I'm getting the
> feeling that recovery OPs are not scheduled in the OP queue that I've
> been working on. Does that sound right? In the OSD logs I'm only
> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
> If the recovery is in another separate queue, then there is no
> reliable way to prioritize OPs between them.
> 
> If I'm going off in to the weeds, please help me get back on the trail.

Yeah, the recovery work isn't in the unified queue yet.

sage



> 
> Thanks,
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA256
> >>>
> >>> Thanks Gregory,
> >>>
> >>> People are most likely busy and haven't had time to digest this and I
> >>> may be expecting more excitement from it (I'm excited due to the
> >>> results and probably also that such a large change still works). I'll
> >>> keep working towards a PR, this was mostly proof of concept, now that
> >>> there is some data I'll clean up the code.
> >>
> >> I'm *very* excited about this.  This is something that almost every
> >> operator has problems with so it's very encouraging to see that switching
> >> up the queue has a big impact in your environment.
> >>
> >> I'm just following up on this after a week of travel, so apologies if this
> >> is covered already, but did you compare this implementation to the
> >> original one with the same tunables?  I see somewhere that you had
> >> max_backfills=20 at some point, which is going to be bad regardless of the
> >> queue.
> >>
> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
> >> implementation.
> >
> > Yes max_backfills=20 is problematic for both queues and from what I
> > can tell is because the OPs are waiting for PGs to get healthy. In a
> > busy cluster it can take a while due to the recovery ops having low
> > priority. In the current queue, it is possible to be blocked for a
> > long time. The new queue seems to prevent that, but they do still back
> > up. After this, I think I'd like to look into promoting recovery OPs
> > that are blocking client OPs to higher priorities so that client I/O
> > doesn't suffer as much during recovery. I think that will be a very
> > different problem to tackle because I don't think I can do the proper
> > introspection at the queue level. I'll have to do that logic in OSD.cc
> > or PG.cc.
> >
> > The strict priority threshold didn't make much of a difference with
> > the original queue. I initially eliminated it all together in the WRR,
> > but there were times that peering would never complete. I want to get
> > as many OPs in the WRR queue to provide fairness as much as possible.
> > I haven't tweaked the setting much in the WRR queue yet.
> >
> >>
> >>> I was thinking that a config option to choose the scheduler would be a
> >>> good idea. In terms of the project what is the better approach: create
> >>> a new template and each place the template class is instantiated
> >>> select the queue, or perform the queue selection in the same template
> >>> class, or something else I haven't thought of.
> >>
> >> A config option would be nice, but I'd start by just cleaning up the code
> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
> >> whatever).  If we find that it's behaving better I'm not sure how much
> >> value we get from a tunable.  Note that there is one other user
> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
> >> point.. especially if this implementation is faster.
> >>
> >> Once it's cleaned up (remove commented out code, new class) put it up as a
> >> PR and we can review and get it through testing.
> >
> > In talking with Samuel in IRC, we think creating an abstract class for
> > the queue is the best option. C++11 allows you to still optimize
> > abstract template classes if you use final in the derived class (I
> > verified the assembly). I'm planning to refactor the code so that
> > similar code can be reused between queues and allows more flexibility
> > in the future (components can chose the queue that works the best for
> > them, etc). The test for which queue to use should be a very simple
> > comparison and it would allow us to let it bake much longer. I hope to
> > have a PR mid next week.
> >
> > - ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: Mailvelope v1.2.3
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
> > xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
> > 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
> > AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
> > v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
> > aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
> > Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
> > ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
> > +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
> > pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
> > w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
> > QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
> > FKe/
> > =yodk
> > -----END PGP SIGNATURE-----
> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN
> pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE
> sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT
> GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I
> TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q
> X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV
> R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo
> kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu
> GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J
> Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt
> 28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu
> FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT
> ABCA
> =yXJO
> -----END PGP SIGNATURE-----
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-08 14:20               ` Sage Weil
@ 2015-11-09 16:49                 ` Samuel Just
  2015-11-09 17:19                   ` Robert LeBlanc
  0 siblings, 1 reply; 24+ messages in thread
From: Samuel Just @ 2015-11-09 16:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Robert LeBlanc, ceph-devel

It's partially in the unified queue.  The primary's background work
for kicking off a recovery operation is not in the unified queue, but
the messages to the replicas (pushes, pull, backfill scans) as well as
their replies are in the unified queue as normal messages.  I've got a
branch moving the primary's work to the queue as well (didn't quite
make infernalis) --
https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
stabilize it now for merge that infernalis is out.
-Sam

On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil <sage@newdream.net> wrote:
> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> After trying to look through the recovery code, I'm getting the
>> feeling that recovery OPs are not scheduled in the OP queue that I've
>> been working on. Does that sound right? In the OSD logs I'm only
>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
>> If the recovery is in another separate queue, then there is no
>> reliable way to prioritize OPs between them.
>>
>> If I'm going off in to the weeds, please help me get back on the trail.
>
> Yeah, the recovery work isn't in the unified queue yet.
>
> sage
>
>
>
>>
>> Thanks,
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA256
>> >
>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> Hash: SHA256
>> >>>
>> >>> Thanks Gregory,
>> >>>
>> >>> People are most likely busy and haven't had time to digest this and I
>> >>> may be expecting more excitement from it (I'm excited due to the
>> >>> results and probably also that such a large change still works). I'll
>> >>> keep working towards a PR, this was mostly proof of concept, now that
>> >>> there is some data I'll clean up the code.
>> >>
>> >> I'm *very* excited about this.  This is something that almost every
>> >> operator has problems with so it's very encouraging to see that switching
>> >> up the queue has a big impact in your environment.
>> >>
>> >> I'm just following up on this after a week of travel, so apologies if this
>> >> is covered already, but did you compare this implementation to the
>> >> original one with the same tunables?  I see somewhere that you had
>> >> max_backfills=20 at some point, which is going to be bad regardless of the
>> >> queue.
>> >>
>> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>> >> implementation.
>> >
>> > Yes max_backfills=20 is problematic for both queues and from what I
>> > can tell is because the OPs are waiting for PGs to get healthy. In a
>> > busy cluster it can take a while due to the recovery ops having low
>> > priority. In the current queue, it is possible to be blocked for a
>> > long time. The new queue seems to prevent that, but they do still back
>> > up. After this, I think I'd like to look into promoting recovery OPs
>> > that are blocking client OPs to higher priorities so that client I/O
>> > doesn't suffer as much during recovery. I think that will be a very
>> > different problem to tackle because I don't think I can do the proper
>> > introspection at the queue level. I'll have to do that logic in OSD.cc
>> > or PG.cc.
>> >
>> > The strict priority threshold didn't make much of a difference with
>> > the original queue. I initially eliminated it all together in the WRR,
>> > but there were times that peering would never complete. I want to get
>> > as many OPs in the WRR queue to provide fairness as much as possible.
>> > I haven't tweaked the setting much in the WRR queue yet.
>> >
>> >>
>> >>> I was thinking that a config option to choose the scheduler would be a
>> >>> good idea. In terms of the project what is the better approach: create
>> >>> a new template and each place the template class is instantiated
>> >>> select the queue, or perform the queue selection in the same template
>> >>> class, or something else I haven't thought of.
>> >>
>> >> A config option would be nice, but I'd start by just cleaning up the code
>> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
>> >> whatever).  If we find that it's behaving better I'm not sure how much
>> >> value we get from a tunable.  Note that there is one other user
>> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
>> >> point.. especially if this implementation is faster.
>> >>
>> >> Once it's cleaned up (remove commented out code, new class) put it up as a
>> >> PR and we can review and get it through testing.
>> >
>> > In talking with Samuel in IRC, we think creating an abstract class for
>> > the queue is the best option. C++11 allows you to still optimize
>> > abstract template classes if you use final in the derived class (I
>> > verified the assembly). I'm planning to refactor the code so that
>> > similar code can be reused between queues and allows more flexibility
>> > in the future (components can chose the queue that works the best for
>> > them, etc). The test for which queue to use should be a very simple
>> > comparison and it would allow us to let it bake much longer. I hope to
>> > have a PR mid next week.
>> >
>> > - ----------------
>> > Robert LeBlanc
>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >
>> > -----BEGIN PGP SIGNATURE-----
>> > Version: Mailvelope v1.2.3
>> > Comment: https://www.mailvelope.com
>> >
>> > wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
>> > xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
>> > 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
>> > AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
>> > v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
>> > aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
>> > Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
>> > ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
>> > +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
>> > pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
>> > w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
>> > QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
>> > FKe/
>> > =yodk
>> > -----END PGP SIGNATURE-----
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN
>> pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE
>> sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT
>> GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I
>> TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q
>> X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV
>> R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo
>> kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu
>> GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J
>> Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt
>> 28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu
>> FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT
>> ABCA
>> =yXJO
>> -----END PGP SIGNATURE-----
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 16:49                 ` Samuel Just
@ 2015-11-09 17:19                   ` Robert LeBlanc
  2015-11-09 18:19                     ` Samuel Just
  0 siblings, 1 reply; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-09 17:19 UTC (permalink / raw)
  To: Samuel Just; +Cc: Sage Weil, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I should probably work against this branch.

I've got some more reading of code to do, but I'm thinking that there
isn't one of these queues for each OSD, it seems like there is one
queue for each thread in the OSD. If this is true, I think it makes
sense to break the queue into it's own thread and have each 'worker'
thread push and pop OPs out of that thread. I have been focused on the
Queue code that I haven't really looked at the OSD/PG code until last
Friday and it is like trying to drink from a fire hose going through
that code, so I may be misunderstanding something.

I'd appreciate any pointers to quickly understanding the OSD/PG code
specifically around the OPs and the queue.

Thanks,
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
ztfA
=GSDL
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just <sjust@redhat.com> wrote:
> It's partially in the unified queue.  The primary's background work
> for kicking off a recovery operation is not in the unified queue, but
> the messages to the replicas (pushes, pull, backfill scans) as well as
> their replies are in the unified queue as normal messages.  I've got a
> branch moving the primary's work to the queue as well (didn't quite
> make infernalis) --
> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
> stabilize it now for merge that infernalis is out.
> -Sam
>
> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil <sage@newdream.net> wrote:
>> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> After trying to look through the recovery code, I'm getting the
>>> feeling that recovery OPs are not scheduled in the OP queue that I've
>>> been working on. Does that sound right? In the OSD logs I'm only
>>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
>>> If the recovery is in another separate queue, then there is no
>>> reliable way to prioritize OPs between them.
>>>
>>> If I'm going off in to the weeds, please help me get back on the trail.
>>
>> Yeah, the recovery work isn't in the unified queue yet.
>>
>> sage
>>
>>
>>
>>>
>>> Thanks,
>>> - ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>> > Hash: SHA256
>>> >
>>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>> >>> Hash: SHA256
>>> >>>
>>> >>> Thanks Gregory,
>>> >>>
>>> >>> People are most likely busy and haven't had time to digest this and I
>>> >>> may be expecting more excitement from it (I'm excited due to the
>>> >>> results and probably also that such a large change still works). I'll
>>> >>> keep working towards a PR, this was mostly proof of concept, now that
>>> >>> there is some data I'll clean up the code.
>>> >>
>>> >> I'm *very* excited about this.  This is something that almost every
>>> >> operator has problems with so it's very encouraging to see that switching
>>> >> up the queue has a big impact in your environment.
>>> >>
>>> >> I'm just following up on this after a week of travel, so apologies if this
>>> >> is covered already, but did you compare this implementation to the
>>> >> original one with the same tunables?  I see somewhere that you had
>>> >> max_backfills=20 at some point, which is going to be bad regardless of the
>>> >> queue.
>>> >>
>>> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
>>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>>> >> implementation.
>>> >
>>> > Yes max_backfills=20 is problematic for both queues and from what I
>>> > can tell is because the OPs are waiting for PGs to get healthy. In a
>>> > busy cluster it can take a while due to the recovery ops having low
>>> > priority. In the current queue, it is possible to be blocked for a
>>> > long time. The new queue seems to prevent that, but they do still back
>>> > up. After this, I think I'd like to look into promoting recovery OPs
>>> > that are blocking client OPs to higher priorities so that client I/O
>>> > doesn't suffer as much during recovery. I think that will be a very
>>> > different problem to tackle because I don't think I can do the proper
>>> > introspection at the queue level. I'll have to do that logic in OSD.cc
>>> > or PG.cc.
>>> >
>>> > The strict priority threshold didn't make much of a difference with
>>> > the original queue. I initially eliminated it all together in the WRR,
>>> > but there were times that peering would never complete. I want to get
>>> > as many OPs in the WRR queue to provide fairness as much as possible.
>>> > I haven't tweaked the setting much in the WRR queue yet.
>>> >
>>> >>
>>> >>> I was thinking that a config option to choose the scheduler would be a
>>> >>> good idea. In terms of the project what is the better approach: create
>>> >>> a new template and each place the template class is instantiated
>>> >>> select the queue, or perform the queue selection in the same template
>>> >>> class, or something else I haven't thought of.
>>> >>
>>> >> A config option would be nice, but I'd start by just cleaning up the code
>>> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
>>> >> whatever).  If we find that it's behaving better I'm not sure how much
>>> >> value we get from a tunable.  Note that there is one other user
>>> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
>>> >> point.. especially if this implementation is faster.
>>> >>
>>> >> Once it's cleaned up (remove commented out code, new class) put it up as a
>>> >> PR and we can review and get it through testing.
>>> >
>>> > In talking with Samuel in IRC, we think creating an abstract class for
>>> > the queue is the best option. C++11 allows you to still optimize
>>> > abstract template classes if you use final in the derived class (I
>>> > verified the assembly). I'm planning to refactor the code so that
>>> > similar code can be reused between queues and allows more flexibility
>>> > in the future (components can chose the queue that works the best for
>>> > them, etc). The test for which queue to use should be a very simple
>>> > comparison and it would allow us to let it bake much longer. I hope to
>>> > have a PR mid next week.
>>> >
>>> > - ----------------
>>> > Robert LeBlanc
>>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> >
>>> > -----BEGIN PGP SIGNATURE-----
>>> > Version: Mailvelope v1.2.3
>>> > Comment: https://www.mailvelope.com
>>> >
>>> > wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
>>> > xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
>>> > 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
>>> > AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
>>> > v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
>>> > aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
>>> > Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
>>> > ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
>>> > +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
>>> > pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
>>> > w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
>>> > QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
>>> > FKe/
>>> > =yodk
>>> > -----END PGP SIGNATURE-----
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.3
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN
>>> pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE
>>> sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT
>>> GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I
>>> TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q
>>> X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV
>>> R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo
>>> kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu
>>> GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J
>>> Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt
>>> 28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu
>>> FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT
>>> ABCA
>>> =yXJO
>>> -----END PGP SIGNATURE-----
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 17:19                   ` Robert LeBlanc
@ 2015-11-09 18:19                     ` Samuel Just
  2015-11-09 18:55                       ` Haomai Wang
  2015-11-09 19:19                       ` Robert LeBlanc
  0 siblings, 2 replies; 24+ messages in thread
From: Samuel Just @ 2015-11-09 18:19 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, ceph-devel

Ops are hashed from the messenger (or any of the other enqueue sources
for non-message items) into one of N queues, each of which is serviced
by M threads.  We can't quite have a single thread own a single queue
yet because the current design allows multiple threads/queue
(important because if a sync read blocks on one thread, other threads
working on that queue can continue to make progress).  However, the
queue contents are hashed to a queue based on the PG, so if a PG
queues work, it'll be on the same queue as it is already operating
from (which I think is what you are getting at?).  I'm moving away
from that with the async read work I'm doing (ceph-devel subject
"Async reads, sync writes, op thread model discussion"), but I'll
still need a replacement for PrioritizedQueue.
-Sam

On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I should probably work against this branch.
>
> I've got some more reading of code to do, but I'm thinking that there
> isn't one of these queues for each OSD, it seems like there is one
> queue for each thread in the OSD. If this is true, I think it makes
> sense to break the queue into it's own thread and have each 'worker'
> thread push and pop OPs out of that thread. I have been focused on the
> Queue code that I haven't really looked at the OSD/PG code until last
> Friday and it is like trying to drink from a fire hose going through
> that code, so I may be misunderstanding something.
>
> I'd appreciate any pointers to quickly understanding the OSD/PG code
> specifically around the OPs and the queue.
>
> Thanks,
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
> EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
> sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
> WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
> EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
> Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
> TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
> V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
> PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
> KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
> iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
> yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
> ztfA
> =GSDL
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just <sjust@redhat.com> wrote:
>> It's partially in the unified queue.  The primary's background work
>> for kicking off a recovery operation is not in the unified queue, but
>> the messages to the replicas (pushes, pull, backfill scans) as well as
>> their replies are in the unified queue as normal messages.  I've got a
>> branch moving the primary's work to the queue as well (didn't quite
>> make infernalis) --
>> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
>> stabilize it now for merge that infernalis is out.
>> -Sam
>>
>> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil <sage@newdream.net> wrote:
>>> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> After trying to look through the recovery code, I'm getting the
>>>> feeling that recovery OPs are not scheduled in the OP queue that I've
>>>> been working on. Does that sound right? In the OSD logs I'm only
>>>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
>>>> If the recovery is in another separate queue, then there is no
>>>> reliable way to prioritize OPs between them.
>>>>
>>>> If I'm going off in to the weeds, please help me get back on the trail.
>>>
>>> Yeah, the recovery work isn't in the unified queue yet.
>>>
>>> sage
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> - ----------------
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
>>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>>> > Hash: SHA256
>>>> >
>>>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>>>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>>>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> >>> Hash: SHA256
>>>> >>>
>>>> >>> Thanks Gregory,
>>>> >>>
>>>> >>> People are most likely busy and haven't had time to digest this and I
>>>> >>> may be expecting more excitement from it (I'm excited due to the
>>>> >>> results and probably also that such a large change still works). I'll
>>>> >>> keep working towards a PR, this was mostly proof of concept, now that
>>>> >>> there is some data I'll clean up the code.
>>>> >>
>>>> >> I'm *very* excited about this.  This is something that almost every
>>>> >> operator has problems with so it's very encouraging to see that switching
>>>> >> up the queue has a big impact in your environment.
>>>> >>
>>>> >> I'm just following up on this after a week of travel, so apologies if this
>>>> >> is covered already, but did you compare this implementation to the
>>>> >> original one with the same tunables?  I see somewhere that you had
>>>> >> max_backfills=20 at some point, which is going to be bad regardless of the
>>>> >> queue.
>>>> >>
>>>> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
>>>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>>>> >> implementation.
>>>> >
>>>> > Yes max_backfills=20 is problematic for both queues and from what I
>>>> > can tell is because the OPs are waiting for PGs to get healthy. In a
>>>> > busy cluster it can take a while due to the recovery ops having low
>>>> > priority. In the current queue, it is possible to be blocked for a
>>>> > long time. The new queue seems to prevent that, but they do still back
>>>> > up. After this, I think I'd like to look into promoting recovery OPs
>>>> > that are blocking client OPs to higher priorities so that client I/O
>>>> > doesn't suffer as much during recovery. I think that will be a very
>>>> > different problem to tackle because I don't think I can do the proper
>>>> > introspection at the queue level. I'll have to do that logic in OSD.cc
>>>> > or PG.cc.
>>>> >
>>>> > The strict priority threshold didn't make much of a difference with
>>>> > the original queue. I initially eliminated it all together in the WRR,
>>>> > but there were times that peering would never complete. I want to get
>>>> > as many OPs in the WRR queue to provide fairness as much as possible.
>>>> > I haven't tweaked the setting much in the WRR queue yet.
>>>> >
>>>> >>
>>>> >>> I was thinking that a config option to choose the scheduler would be a
>>>> >>> good idea. In terms of the project what is the better approach: create
>>>> >>> a new template and each place the template class is instantiated
>>>> >>> select the queue, or perform the queue selection in the same template
>>>> >>> class, or something else I haven't thought of.
>>>> >>
>>>> >> A config option would be nice, but I'd start by just cleaning up the code
>>>> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
>>>> >> whatever).  If we find that it's behaving better I'm not sure how much
>>>> >> value we get from a tunable.  Note that there is one other user
>>>> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
>>>> >> point.. especially if this implementation is faster.
>>>> >>
>>>> >> Once it's cleaned up (remove commented out code, new class) put it up as a
>>>> >> PR and we can review and get it through testing.
>>>> >
>>>> > In talking with Samuel in IRC, we think creating an abstract class for
>>>> > the queue is the best option. C++11 allows you to still optimize
>>>> > abstract template classes if you use final in the derived class (I
>>>> > verified the assembly). I'm planning to refactor the code so that
>>>> > similar code can be reused between queues and allows more flexibility
>>>> > in the future (components can chose the queue that works the best for
>>>> > them, etc). The test for which queue to use should be a very simple
>>>> > comparison and it would allow us to let it bake much longer. I hope to
>>>> > have a PR mid next week.
>>>> >
>>>> > - ----------------
>>>> > Robert LeBlanc
>>>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>> >
>>>> > -----BEGIN PGP SIGNATURE-----
>>>> > Version: Mailvelope v1.2.3
>>>> > Comment: https://www.mailvelope.com
>>>> >
>>>> > wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
>>>> > xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
>>>> > 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
>>>> > AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
>>>> > v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
>>>> > aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
>>>> > Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
>>>> > ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
>>>> > +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
>>>> > pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
>>>> > w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
>>>> > QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
>>>> > FKe/
>>>> > =yodk
>>>> > -----END PGP SIGNATURE-----
>>>>
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: Mailvelope v1.2.3
>>>> Comment: https://www.mailvelope.com
>>>>
>>>> wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN
>>>> pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE
>>>> sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT
>>>> GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I
>>>> TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q
>>>> X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV
>>>> R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo
>>>> kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu
>>>> GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J
>>>> Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt
>>>> 28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu
>>>> FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT
>>>> ABCA
>>>> =yXJO
>>>> -----END PGP SIGNATURE-----
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 18:19                     ` Samuel Just
@ 2015-11-09 18:55                       ` Haomai Wang
  2015-11-09 19:19                       ` Robert LeBlanc
  1 sibling, 0 replies; 24+ messages in thread
From: Haomai Wang @ 2015-11-09 18:55 UTC (permalink / raw)
  To: Samuel Just; +Cc: Robert LeBlanc, Sage Weil, ceph-devel

On Tue, Nov 10, 2015 at 2:19 AM, Samuel Just <sjust@redhat.com> wrote:
> Ops are hashed from the messenger (or any of the other enqueue sources
> for non-message items) into one of N queues, each of which is serviced
> by M threads.  We can't quite have a single thread own a single queue
> yet because the current design allows multiple threads/queue
> (important because if a sync read blocks on one thread, other threads
> working on that queue can continue to make progress).  However, the
> queue contents are hashed to a queue based on the PG, so if a PG
> queues work, it'll be on the same queue as it is already operating
> from (which I think is what you are getting at?).  I'm moving away
> from that with the async read work I'm doing (ceph-devel subject
> "Async reads, sync writes, op thread model discussion"), but I'll
> still need a replacement for PrioritizedQueue.

I don't think clearly about the idea that we make PrioriryQueue(or
whatever WeightBased) client-oriented. Because currently each
connection owned by a async messenger thread, if latter queue is pg
oriented, huge lock contention can't be avoided with iops increasing.

The only way I guess is make msgr thread -> osd thread via the same
hash key(or whatever we can make the two threads paired). What's more,
msgr thread could use the same way as sam's branch, it could be only
one thread.

> -Sam
>
> On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I should probably work against this branch.
>>
>> I've got some more reading of code to do, but I'm thinking that there
>> isn't one of these queues for each OSD, it seems like there is one
>> queue for each thread in the OSD. If this is true, I think it makes
>> sense to break the queue into it's own thread and have each 'worker'
>> thread push and pop OPs out of that thread. I have been focused on the
>> Queue code that I haven't really looked at the OSD/PG code until last
>> Friday and it is like trying to drink from a fire hose going through
>> that code, so I may be misunderstanding something.
>>
>> I'd appreciate any pointers to quickly understanding the OSD/PG code
>> specifically around the OPs and the queue.
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
>> EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
>> sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
>> WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
>> EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
>> Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
>> TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
>> V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
>> PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
>> KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
>> iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
>> yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
>> ztfA
>> =GSDL
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just <sjust@redhat.com> wrote:
>>> It's partially in the unified queue.  The primary's background work
>>> for kicking off a recovery operation is not in the unified queue, but
>>> the messages to the replicas (pushes, pull, backfill scans) as well as
>>> their replies are in the unified queue as normal messages.  I've got a
>>> branch moving the primary's work to the queue as well (didn't quite
>>> make infernalis) --
>>> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
>>> stabilize it now for merge that infernalis is out.
>>> -Sam
>>>
>>> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil <sage@newdream.net> wrote:
>>>> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>>
>>>>> After trying to look through the recovery code, I'm getting the
>>>>> feeling that recovery OPs are not scheduled in the OP queue that I've
>>>>> been working on. Does that sound right? In the OSD logs I'm only
>>>>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
>>>>> If the recovery is in another separate queue, then there is no
>>>>> reliable way to prioritize OPs between them.
>>>>>
>>>>> If I'm going off in to the weeds, please help me get back on the trail.
>>>>
>>>> Yeah, the recovery work isn't in the unified queue yet.
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>
>>>>>
>>>>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
>>>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>>>> > Hash: SHA256
>>>>> >
>>>>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>>>>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>>>>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >>> Hash: SHA256
>>>>> >>>
>>>>> >>> Thanks Gregory,
>>>>> >>>
>>>>> >>> People are most likely busy and haven't had time to digest this and I
>>>>> >>> may be expecting more excitement from it (I'm excited due to the
>>>>> >>> results and probably also that such a large change still works). I'll
>>>>> >>> keep working towards a PR, this was mostly proof of concept, now that
>>>>> >>> there is some data I'll clean up the code.
>>>>> >>
>>>>> >> I'm *very* excited about this.  This is something that almost every
>>>>> >> operator has problems with so it's very encouraging to see that switching
>>>>> >> up the queue has a big impact in your environment.
>>>>> >>
>>>>> >> I'm just following up on this after a week of travel, so apologies if this
>>>>> >> is covered already, but did you compare this implementation to the
>>>>> >> original one with the same tunables?  I see somewhere that you had
>>>>> >> max_backfills=20 at some point, which is going to be bad regardless of the
>>>>> >> queue.
>>>>> >>
>>>>> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
>>>>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>>>>> >> implementation.
>>>>> >
>>>>> > Yes max_backfills=20 is problematic for both queues and from what I
>>>>> > can tell is because the OPs are waiting for PGs to get healthy. In a
>>>>> > busy cluster it can take a while due to the recovery ops having low
>>>>> > priority. In the current queue, it is possible to be blocked for a
>>>>> > long time. The new queue seems to prevent that, but they do still back
>>>>> > up. After this, I think I'd like to look into promoting recovery OPs
>>>>> > that are blocking client OPs to higher priorities so that client I/O
>>>>> > doesn't suffer as much during recovery. I think that will be a very
>>>>> > different problem to tackle because I don't think I can do the proper
>>>>> > introspection at the queue level. I'll have to do that logic in OSD.cc
>>>>> > or PG.cc.
>>>>> >
>>>>> > The strict priority threshold didn't make much of a difference with
>>>>> > the original queue. I initially eliminated it all together in the WRR,
>>>>> > but there were times that peering would never complete. I want to get
>>>>> > as many OPs in the WRR queue to provide fairness as much as possible.
>>>>> > I haven't tweaked the setting much in the WRR queue yet.
>>>>> >
>>>>> >>
>>>>> >>> I was thinking that a config option to choose the scheduler would be a
>>>>> >>> good idea. In terms of the project what is the better approach: create
>>>>> >>> a new template and each place the template class is instantiated
>>>>> >>> select the queue, or perform the queue selection in the same template
>>>>> >>> class, or something else I haven't thought of.
>>>>> >>
>>>>> >> A config option would be nice, but I'd start by just cleaning up the code
>>>>> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
>>>>> >> whatever).  If we find that it's behaving better I'm not sure how much
>>>>> >> value we get from a tunable.  Note that there is one other user
>>>>> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
>>>>> >> point.. especially if this implementation is faster.
>>>>> >>
>>>>> >> Once it's cleaned up (remove commented out code, new class) put it up as a
>>>>> >> PR and we can review and get it through testing.
>>>>> >
>>>>> > In talking with Samuel in IRC, we think creating an abstract class for
>>>>> > the queue is the best option. C++11 allows you to still optimize
>>>>> > abstract template classes if you use final in the derived class (I
>>>>> > verified the assembly). I'm planning to refactor the code so that
>>>>> > similar code can be reused between queues and allows more flexibility
>>>>> > in the future (components can chose the queue that works the best for
>>>>> > them, etc). The test for which queue to use should be a very simple
>>>>> > comparison and it would allow us to let it bake much longer. I hope to
>>>>> > have a PR mid next week.
>>>>> >
>>>>> > - ----------------
>>>>> > Robert LeBlanc
>>>>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >
>>>>> > -----BEGIN PGP SIGNATURE-----
>>>>> > Version: Mailvelope v1.2.3
>>>>> > Comment: https://www.mailvelope.com
>>>>> >
>>>>> > wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
>>>>> > xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
>>>>> > 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
>>>>> > AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
>>>>> > v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
>>>>> > aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
>>>>> > Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
>>>>> > ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
>>>>> > +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
>>>>> > pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
>>>>> > w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
>>>>> > QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
>>>>> > FKe/
>>>>> > =yodk
>>>>> > -----END PGP SIGNATURE-----
>>>>>
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.3
>>>>> Comment: https://www.mailvelope.com
>>>>>
>>>>> wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN
>>>>> pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE
>>>>> sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT
>>>>> GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I
>>>>> TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q
>>>>> X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV
>>>>> R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo
>>>>> kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu
>>>>> GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J
>>>>> Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt
>>>>> 28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu
>>>>> FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT
>>>>> ABCA
>>>>> =yXJO
>>>>> -----END PGP SIGNATURE-----
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 18:19                     ` Samuel Just
  2015-11-09 18:55                       ` Haomai Wang
@ 2015-11-09 19:19                       ` Robert LeBlanc
  2015-11-09 19:47                         ` Samuel Just
  1 sibling, 1 reply; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-09 19:19 UTC (permalink / raw)
  To: Samuel Just; +Cc: Sage Weil, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Thanks, I think some of the fog is clearing. I was wondering how
operations between threads were keeping the order of operations in
PGs, that explains it.

My original thoughts were to have a queue in front and behind the
Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
The queue thread would pull ops off that queue and place them into the
specialized queue, do house keeping, etc and would dequeue ops in that
queue to a post-queue that worker threads would monitor. The thread
queue could keep a certain amount of items in the post-queue to
prevent starvation and worker threads from being blocked.

It would require the worker thread to be able to handle any kind of
op, or having separate post-queues for the different kinds of work.
I'm getting the feeling that this may be a far too simplistic approach
to the problem (or at least in terms of the organization of Ceph at
this point). I'm also starting to feel that I'm getting out of my
league trying to understand all the intricacies of the OSD work flow
(trying to start with one of the most complicated parts of the system
doesn't help).

Maybe what I should do is just code up the queue to drop in as a
replacement for the Prio queue for the moment. Then as your async work
is completing we can shake out the potential issues with recovery and
costs that we talked about earlier. One thing that I'd like to look
into is elevating the priority of recovery ops that have client OPs
blocked. I don't think the WRR queue gives the recovery thread a lot
of time to get its work done.

Based on some testing on Friday, the number of recovery ops on an osd
did not really change if there were 20 backfilling or 1 backfilling.
The difference came in with how many client I/Os were blocked waiting
for objects to recover. When 20 backfills were going, there were a lot
more blocked I/O waiting for objects to show up or recover. With one
backfill, there were far less blocked I/O, but there were still times
I/O would block.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQPHBCRDmVDuy+mK58QAA72EQAMgzgrw3OAvBi1/NmuWl
LXGM0qGz3hE/p5oUsnqcnz2/+VYP3FZRanszyuU8+vKCwj+I/Ny9Olm1JAnw
DSE7PvhuO6J5w0ymOIccKdX7uk2QZyP8ggO1D5fLC2M9/xqQQSZrAPE7vc4j
O9HHuZsMF+ABUKU5RVCjn1ax+y2LhpetxH3nu37xpSKPDPFiowVnW8YlBGJy
Cf1FYMVDLv60F5EmjstOn4FhSXC/+DuSATwP+CmNEPZ3JNTBgtPuU/22/De3
M4ZdDzeylVWYB66vbL9ijLeZDoCaxKgFL+QwUAswefaDBD1citCU2v7/7VQP
aChnSzI8BYG0bHg5u7QEohzQyJUCC1OubiRkbUmOOeCiBI0Lqv3jf321T4ss
PD3hqkagyhRe67zPB6bhhik0ZDOYHTAyV/ceAae4VDJTgu+/gI8Gc1c3mp5g
nZL5z7hVohZ0AvfdEzasRhTnTcH6TfO9lpqU2nyMAc76SoPyDSTmAcMVt0tj
/1BQAnk/I5rlCL5CKTxb2LR1/5WJt0eh7xtyKU1B0yh4G7JlMf/3kmrznOWu
VEUUA3mJ1depDToadnECnCZMKHrGYC36XCy8xq3FDqhvl4BWV0VMA+yi1uhj
zZ5udKKbN5Cxo/Sc48DG8wz9lQKn4LPCH2PD81oTcTfyd1iG2oNNkchrXa6K
iwed
=WjDS
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Nov 9, 2015 at 11:19 AM, Samuel Just <sjust@redhat.com> wrote:
> Ops are hashed from the messenger (or any of the other enqueue sources
> for non-message items) into one of N queues, each of which is serviced
> by M threads.  We can't quite have a single thread own a single queue
> yet because the current design allows multiple threads/queue
> (important because if a sync read blocks on one thread, other threads
> working on that queue can continue to make progress).  However, the
> queue contents are hashed to a queue based on the PG, so if a PG
> queues work, it'll be on the same queue as it is already operating
> from (which I think is what you are getting at?).  I'm moving away
> from that with the async read work I'm doing (ceph-devel subject
> "Async reads, sync writes, op thread model discussion"), but I'll
> still need a replacement for PrioritizedQueue.
> -Sam
>
> On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> I should probably work against this branch.
>>
>> I've got some more reading of code to do, but I'm thinking that there
>> isn't one of these queues for each OSD, it seems like there is one
>> queue for each thread in the OSD. If this is true, I think it makes
>> sense to break the queue into it's own thread and have each 'worker'
>> thread push and pop OPs out of that thread. I have been focused on the
>> Queue code that I haven't really looked at the OSD/PG code until last
>> Friday and it is like trying to drink from a fire hose going through
>> that code, so I may be misunderstanding something.
>>
>> I'd appreciate any pointers to quickly understanding the OSD/PG code
>> specifically around the OPs and the queue.
>>
>> Thanks,
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
>> EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
>> sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
>> WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
>> EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
>> Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
>> TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
>> V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
>> PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
>> KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
>> iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
>> yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
>> ztfA
>> =GSDL
>> -----END PGP SIGNATURE-----
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just <sjust@redhat.com> wrote:
>>> It's partially in the unified queue.  The primary's background work
>>> for kicking off a recovery operation is not in the unified queue, but
>>> the messages to the replicas (pushes, pull, backfill scans) as well as
>>> their replies are in the unified queue as normal messages.  I've got a
>>> branch moving the primary's work to the queue as well (didn't quite
>>> make infernalis) --
>>> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
>>> stabilize it now for merge that infernalis is out.
>>> -Sam
>>>
>>> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil <sage@newdream.net> wrote:
>>>> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA256
>>>>>
>>>>> After trying to look through the recovery code, I'm getting the
>>>>> feeling that recovery OPs are not scheduled in the OP queue that I've
>>>>> been working on. Does that sound right? In the OSD logs I'm only
>>>>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
>>>>> If the recovery is in another separate queue, then there is no
>>>>> reliable way to prioritize OPs between them.
>>>>>
>>>>> If I'm going off in to the weeds, please help me get back on the trail.
>>>>
>>>> Yeah, the recovery work isn't in the unified queue yet.
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> - ----------------
>>>>> Robert LeBlanc
>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>
>>>>>
>>>>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
>>>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>>>> > Hash: SHA256
>>>>> >
>>>>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>>>>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>>>>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> >>> Hash: SHA256
>>>>> >>>
>>>>> >>> Thanks Gregory,
>>>>> >>>
>>>>> >>> People are most likely busy and haven't had time to digest this and I
>>>>> >>> may be expecting more excitement from it (I'm excited due to the
>>>>> >>> results and probably also that such a large change still works). I'll
>>>>> >>> keep working towards a PR, this was mostly proof of concept, now that
>>>>> >>> there is some data I'll clean up the code.
>>>>> >>
>>>>> >> I'm *very* excited about this.  This is something that almost every
>>>>> >> operator has problems with so it's very encouraging to see that switching
>>>>> >> up the queue has a big impact in your environment.
>>>>> >>
>>>>> >> I'm just following up on this after a week of travel, so apologies if this
>>>>> >> is covered already, but did you compare this implementation to the
>>>>> >> original one with the same tunables?  I see somewhere that you had
>>>>> >> max_backfills=20 at some point, which is going to be bad regardless of the
>>>>> >> queue.
>>>>> >>
>>>>> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
>>>>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>>>>> >> implementation.
>>>>> >
>>>>> > Yes max_backfills=20 is problematic for both queues and from what I
>>>>> > can tell is because the OPs are waiting for PGs to get healthy. In a
>>>>> > busy cluster it can take a while due to the recovery ops having low
>>>>> > priority. In the current queue, it is possible to be blocked for a
>>>>> > long time. The new queue seems to prevent that, but they do still back
>>>>> > up. After this, I think I'd like to look into promoting recovery OPs
>>>>> > that are blocking client OPs to higher priorities so that client I/O
>>>>> > doesn't suffer as much during recovery. I think that will be a very
>>>>> > different problem to tackle because I don't think I can do the proper
>>>>> > introspection at the queue level. I'll have to do that logic in OSD.cc
>>>>> > or PG.cc.
>>>>> >
>>>>> > The strict priority threshold didn't make much of a difference with
>>>>> > the original queue. I initially eliminated it all together in the WRR,
>>>>> > but there were times that peering would never complete. I want to get
>>>>> > as many OPs in the WRR queue to provide fairness as much as possible.
>>>>> > I haven't tweaked the setting much in the WRR queue yet.
>>>>> >
>>>>> >>
>>>>> >>> I was thinking that a config option to choose the scheduler would be a
>>>>> >>> good idea. In terms of the project what is the better approach: create
>>>>> >>> a new template and each place the template class is instantiated
>>>>> >>> select the queue, or perform the queue selection in the same template
>>>>> >>> class, or something else I haven't thought of.
>>>>> >>
>>>>> >> A config option would be nice, but I'd start by just cleaning up the code
>>>>> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
>>>>> >> whatever).  If we find that it's behaving better I'm not sure how much
>>>>> >> value we get from a tunable.  Note that there is one other user
>>>>> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
>>>>> >> point.. especially if this implementation is faster.
>>>>> >>
>>>>> >> Once it's cleaned up (remove commented out code, new class) put it up as a
>>>>> >> PR and we can review and get it through testing.
>>>>> >
>>>>> > In talking with Samuel in IRC, we think creating an abstract class for
>>>>> > the queue is the best option. C++11 allows you to still optimize
>>>>> > abstract template classes if you use final in the derived class (I
>>>>> > verified the assembly). I'm planning to refactor the code so that
>>>>> > similar code can be reused between queues and allows more flexibility
>>>>> > in the future (components can chose the queue that works the best for
>>>>> > them, etc). The test for which queue to use should be a very simple
>>>>> > comparison and it would allow us to let it bake much longer. I hope to
>>>>> > have a PR mid next week.
>>>>> >
>>>>> > - ----------------
>>>>> > Robert LeBlanc
>>>>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> >
>>>>> > -----BEGIN PGP SIGNATURE-----
>>>>> > Version: Mailvelope v1.2.3
>>>>> > Comment: https://www.mailvelope.com
>>>>> >
>>>>> > wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
>>>>> > xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
>>>>> > 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
>>>>> > AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
>>>>> > v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
>>>>> > aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
>>>>> > Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
>>>>> > ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
>>>>> > +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
>>>>> > pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
>>>>> > w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
>>>>> > QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
>>>>> > FKe/
>>>>> > =yodk
>>>>> > -----END PGP SIGNATURE-----
>>>>>
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: Mailvelope v1.2.3
>>>>> Comment: https://www.mailvelope.com
>>>>>
>>>>> wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN
>>>>> pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE
>>>>> sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT
>>>>> GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I
>>>>> TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q
>>>>> X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV
>>>>> R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo
>>>>> kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu
>>>>> GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J
>>>>> Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt
>>>>> 28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu
>>>>> FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT
>>>>> ABCA
>>>>> =yXJO
>>>>> -----END PGP SIGNATURE-----
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 19:19                       ` Robert LeBlanc
@ 2015-11-09 19:47                         ` Samuel Just
  2015-11-09 20:31                           ` Robert LeBlanc
  0 siblings, 1 reply; 24+ messages in thread
From: Samuel Just @ 2015-11-09 19:47 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, ceph-devel

What I really want from PrioritizedQueue (and from the dmclock/mclock
approaches that are also being worked on) is a solution to the problem
of efficiently deciding which op to do next taking into account
fairness across io classes and ops with different costs.

On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Thanks, I think some of the fog is clearing. I was wondering how
> operations between threads were keeping the order of operations in
> PGs, that explains it.
>
> My original thoughts were to have a queue in front and behind the
> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
> The queue thread would pull ops off that queue and place them into the
> specialized queue, do house keeping, etc and would dequeue ops in that
> queue to a post-queue that worker threads would monitor. The thread
> queue could keep a certain amount of items in the post-queue to
> prevent starvation and worker threads from being blocked.

I'm not sure what the advantage of this would be -- it adds another thread
to the processing pipeline at best.

>
> It would require the worker thread to be able to handle any kind of
> op, or having separate post-queues for the different kinds of work.
> I'm getting the feeling that this may be a far too simplistic approach
> to the problem (or at least in terms of the organization of Ceph at
> this point). I'm also starting to feel that I'm getting out of my
> league trying to understand all the intricacies of the OSD work flow
> (trying to start with one of the most complicated parts of the system
> doesn't help).
>
> Maybe what I should do is just code up the queue to drop in as a
> replacement for the Prio queue for the moment. Then as your async work
> is completing we can shake out the potential issues with recovery and
> costs that we talked about earlier. One thing that I'd like to look
> into is elevating the priority of recovery ops that have client OPs
> blocked. I don't think the WRR queue gives the recovery thread a lot
> of time to get its work done.
>

If an op comes in that requires recovery to happen before it can be
processed, we send the recovery messages with client priority rather
than recovery priority.

> Based on some testing on Friday, the number of recovery ops on an osd
> did not really change if there were 20 backfilling or 1 backfilling.
> The difference came in with how many client I/Os were blocked waiting
> for objects to recover. When 20 backfills were going, there were a lot
> more blocked I/O waiting for objects to show up or recover. With one
> backfill, there were far less blocked I/O, but there were still times
> I/O would block.

The number of recovery ops is actually a separate configurable
(osd_recovery_max_active -- default to 15).  It's odd that with more
backfilling on a single osd, there is more blocked IO.  Looking into
that would be helpful and would probably give you some insight
into recovery and the op processing pipeline.
-Sam

> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWQPHBCRDmVDuy+mK58QAA72EQAMgzgrw3OAvBi1/NmuWl
> LXGM0qGz3hE/p5oUsnqcnz2/+VYP3FZRanszyuU8+vKCwj+I/Ny9Olm1JAnw
> DSE7PvhuO6J5w0ymOIccKdX7uk2QZyP8ggO1D5fLC2M9/xqQQSZrAPE7vc4j
> O9HHuZsMF+ABUKU5RVCjn1ax+y2LhpetxH3nu37xpSKPDPFiowVnW8YlBGJy
> Cf1FYMVDLv60F5EmjstOn4FhSXC/+DuSATwP+CmNEPZ3JNTBgtPuU/22/De3
> M4ZdDzeylVWYB66vbL9ijLeZDoCaxKgFL+QwUAswefaDBD1citCU2v7/7VQP
> aChnSzI8BYG0bHg5u7QEohzQyJUCC1OubiRkbUmOOeCiBI0Lqv3jf321T4ss
> PD3hqkagyhRe67zPB6bhhik0ZDOYHTAyV/ceAae4VDJTgu+/gI8Gc1c3mp5g
> nZL5z7hVohZ0AvfdEzasRhTnTcH6TfO9lpqU2nyMAc76SoPyDSTmAcMVt0tj
> /1BQAnk/I5rlCL5CKTxb2LR1/5WJt0eh7xtyKU1B0yh4G7JlMf/3kmrznOWu
> VEUUA3mJ1depDToadnECnCZMKHrGYC36XCy8xq3FDqhvl4BWV0VMA+yi1uhj
> zZ5udKKbN5Cxo/Sc48DG8wz9lQKn4LPCH2PD81oTcTfyd1iG2oNNkchrXa6K
> iwed
> =WjDS
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Nov 9, 2015 at 11:19 AM, Samuel Just <sjust@redhat.com> wrote:
>> Ops are hashed from the messenger (or any of the other enqueue sources
>> for non-message items) into one of N queues, each of which is serviced
>> by M threads.  We can't quite have a single thread own a single queue
>> yet because the current design allows multiple threads/queue
>> (important because if a sync read blocks on one thread, other threads
>> working on that queue can continue to make progress).  However, the
>> queue contents are hashed to a queue based on the PG, so if a PG
>> queues work, it'll be on the same queue as it is already operating
>> from (which I think is what you are getting at?).  I'm moving away
>> from that with the async read work I'm doing (ceph-devel subject
>> "Async reads, sync writes, op thread model discussion"), but I'll
>> still need a replacement for PrioritizedQueue.
>> -Sam
>>
>> On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> I should probably work against this branch.
>>>
>>> I've got some more reading of code to do, but I'm thinking that there
>>> isn't one of these queues for each OSD, it seems like there is one
>>> queue for each thread in the OSD. If this is true, I think it makes
>>> sense to break the queue into it's own thread and have each 'worker'
>>> thread push and pop OPs out of that thread. I have been focused on the
>>> Queue code that I haven't really looked at the OSD/PG code until last
>>> Friday and it is like trying to drink from a fire hose going through
>>> that code, so I may be misunderstanding something.
>>>
>>> I'd appreciate any pointers to quickly understanding the OSD/PG code
>>> specifically around the OPs and the queue.
>>>
>>> Thanks,
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: Mailvelope v1.2.3
>>> Comment: https://www.mailvelope.com
>>>
>>> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
>>> EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
>>> sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
>>> WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
>>> EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
>>> Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
>>> TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
>>> V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
>>> PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
>>> KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
>>> iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
>>> yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
>>> ztfA
>>> =GSDL
>>> -----END PGP SIGNATURE-----
>>> ----------------
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just <sjust@redhat.com> wrote:
>>>> It's partially in the unified queue.  The primary's background work
>>>> for kicking off a recovery operation is not in the unified queue, but
>>>> the messages to the replicas (pushes, pull, backfill scans) as well as
>>>> their replies are in the unified queue as normal messages.  I've got a
>>>> branch moving the primary's work to the queue as well (didn't quite
>>>> make infernalis) --
>>>> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
>>>> stabilize it now for merge that infernalis is out.
>>>> -Sam
>>>>
>>>> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil <sage@newdream.net> wrote:
>>>>> On Fri, 6 Nov 2015, Robert LeBlanc wrote:
>>>>>
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA256
>>>>>>
>>>>>> After trying to look through the recovery code, I'm getting the
>>>>>> feeling that recovery OPs are not scheduled in the OP queue that I've
>>>>>> been working on. Does that sound right? In the OSD logs I'm only
>>>>>> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
>>>>>> If the recovery is in another separate queue, then there is no
>>>>>> reliable way to prioritize OPs between them.
>>>>>>
>>>>>> If I'm going off in to the weeds, please help me get back on the trail.
>>>>>
>>>>> Yeah, the recovery work isn't in the unified queue yet.
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> - ----------------
>>>>>> Robert LeBlanc
>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
>>>>>> > -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> > Hash: SHA256
>>>>>> >
>>>>>> > On Fri, Nov 6, 2015 at 3:12 AM, Sage Weil  wrote:
>>>>>> >> On Thu, 5 Nov 2015, Robert LeBlanc wrote:
>>>>>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> >>> Hash: SHA256
>>>>>> >>>
>>>>>> >>> Thanks Gregory,
>>>>>> >>>
>>>>>> >>> People are most likely busy and haven't had time to digest this and I
>>>>>> >>> may be expecting more excitement from it (I'm excited due to the
>>>>>> >>> results and probably also that such a large change still works). I'll
>>>>>> >>> keep working towards a PR, this was mostly proof of concept, now that
>>>>>> >>> there is some data I'll clean up the code.
>>>>>> >>
>>>>>> >> I'm *very* excited about this.  This is something that almost every
>>>>>> >> operator has problems with so it's very encouraging to see that switching
>>>>>> >> up the queue has a big impact in your environment.
>>>>>> >>
>>>>>> >> I'm just following up on this after a week of travel, so apologies if this
>>>>>> >> is covered already, but did you compare this implementation to the
>>>>>> >> original one with the same tunables?  I see somewhere that you had
>>>>>> >> max_backfills=20 at some point, which is going to be bad regardless of the
>>>>>> >> queue.
>>>>>> >>
>>>>>> >> I also see that you chnaged the strict priority threshold from LOW to HIGH
>>>>>> >> in OSD.cc; I'm curious how much of an impact was from this vs the queue
>>>>>> >> implementation.
>>>>>> >
>>>>>> > Yes max_backfills=20 is problematic for both queues and from what I
>>>>>> > can tell is because the OPs are waiting for PGs to get healthy. In a
>>>>>> > busy cluster it can take a while due to the recovery ops having low
>>>>>> > priority. In the current queue, it is possible to be blocked for a
>>>>>> > long time. The new queue seems to prevent that, but they do still back
>>>>>> > up. After this, I think I'd like to look into promoting recovery OPs
>>>>>> > that are blocking client OPs to higher priorities so that client I/O
>>>>>> > doesn't suffer as much during recovery. I think that will be a very
>>>>>> > different problem to tackle because I don't think I can do the proper
>>>>>> > introspection at the queue level. I'll have to do that logic in OSD.cc
>>>>>> > or PG.cc.
>>>>>> >
>>>>>> > The strict priority threshold didn't make much of a difference with
>>>>>> > the original queue. I initially eliminated it all together in the WRR,
>>>>>> > but there were times that peering would never complete. I want to get
>>>>>> > as many OPs in the WRR queue to provide fairness as much as possible.
>>>>>> > I haven't tweaked the setting much in the WRR queue yet.
>>>>>> >
>>>>>> >>
>>>>>> >>> I was thinking that a config option to choose the scheduler would be a
>>>>>> >>> good idea. In terms of the project what is the better approach: create
>>>>>> >>> a new template and each place the template class is instantiated
>>>>>> >>> select the queue, or perform the queue selection in the same template
>>>>>> >>> class, or something else I haven't thought of.
>>>>>> >>
>>>>>> >> A config option would be nice, but I'd start by just cleaning up the code
>>>>>> >> and putting it in a new class (WeightedRoundRobinPriorityQueue or
>>>>>> >> whatever).  If we find that it's behaving better I'm not sure how much
>>>>>> >> value we get from a tunable.  Note that there is one other user
>>>>>> >> (msgr/simple/DispatchQueue) that we might also was to switch over at some
>>>>>> >> point.. especially if this implementation is faster.
>>>>>> >>
>>>>>> >> Once it's cleaned up (remove commented out code, new class) put it up as a
>>>>>> >> PR and we can review and get it through testing.
>>>>>> >
>>>>>> > In talking with Samuel in IRC, we think creating an abstract class for
>>>>>> > the queue is the best option. C++11 allows you to still optimize
>>>>>> > abstract template classes if you use final in the derived class (I
>>>>>> > verified the assembly). I'm planning to refactor the code so that
>>>>>> > similar code can be reused between queues and allows more flexibility
>>>>>> > in the future (components can chose the queue that works the best for
>>>>>> > them, etc). The test for which queue to use should be a very simple
>>>>>> > comparison and it would allow us to let it bake much longer. I hope to
>>>>>> > have a PR mid next week.
>>>>>> >
>>>>>> > - ----------------
>>>>>> > Robert LeBlanc
>>>>>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>>> >
>>>>>> > -----BEGIN PGP SIGNATURE-----
>>>>>> > Version: Mailvelope v1.2.3
>>>>>> > Comment: https://www.mailvelope.com
>>>>>> >
>>>>>> > wsFcBAEBCAAQBQJWPN1xCRDmVDuy+mK58QAA2XwP/1bv4DUVTfoAGU8q6RDK
>>>>>> > xXCcqNoy2rFcG/D4wipnnGrjMYnVlH33l73hyaZiSQzMwvfzBAl5igQbIlAh
>>>>>> > 41yqXOaGxk+BYRXRNHL5KCP0p0esjV8Wv1z9X2yfKdWeHbwueOKju5ljDQ6X
>>>>>> > AaVXefw1fdag8JEvSjh0dsjgh8wf3G+lAcC9GHB/PFNHXYsl1BVOUz1REnno
>>>>>> > v5vIAZz+iySb8vVrWXJUBaPdW9aao/sqJFU2ZHBziWgeIZ9OlrTlhr9znsxy
>>>>>> > aDa18suMC8vhcrZjyAgKlSbxhgynWh7R2RjxFA5ZObBEsdbztJfg9ibyDzKG
>>>>>> > Ngpe+jVXGTM03z4ohajzPPJ0tzj03XpGc45yXzj6Q4NHOlp5CPdzAPgmxQkz
>>>>>> > ot5cAIR83z67PBIkemeiBQvbC4/ToVCXIBCfEPVW5Yu6grnTd4+AAKxTakip
>>>>>> > +tXSai03MNMlNBeaBnooZ/li7s9VMSluXheZ2JNs9ssRTZkGQH3Pof3p3Y5t
>>>>>> > pAb7qeRlxm+t+i1rZ1tn1FtF/YAx4DKGvyFz4Pzk8pe77jZ+nQLMtoOJJgGJ
>>>>>> > w/+TGiegnUPt6pqWf/Z5o6+GB8SiM/5zKr+Xkm8aIcju/Fq0qy3fx96z81Cv
>>>>>> > QC25ZklTblVt1ImSG30qoVcZdqWKTMwnJhpFNj8GVbzyV5EoFh4T0YBmu3fm
>>>>>> > FKe/
>>>>>> > =yodk
>>>>>> > -----END PGP SIGNATURE-----
>>>>>>
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> Version: Mailvelope v1.2.3
>>>>>> Comment: https://www.mailvelope.com
>>>>>>
>>>>>> wsFcBAEBCAAQBQJWPVZPCRDmVDuy+mK58QAAyK4QAL4ZdF0bRxSVSQAZGgDN
>>>>>> pEfGEO1+heaj5Uj1sUitoXct5f//TbXcnuJDStlMe0rbplZDPUU0ZsXs8hNE
>>>>>> sro6GiFuSP6ZQgHshW50d8iCGjmF/DKhYPs6jWJUIwCMelY45YLfpadAmkZT
>>>>>> GePGEu5UzhYhlfQeiaQOFd7jWH2uVOnPLASK6f68cNRUv8rywJ8q5/6h0p8I
>>>>>> TPg277NglGP1VntZ0z4/9CsSl49YOowVQooRZ9JQr3BpFYsbSEBBY5vLak8q
>>>>>> X9Rb0rngG52vKT5VE58wUY/Pfbdwn7nbnV/BOUBnhBr+f14QKhNsWKpVM9EV
>>>>>> R/cjlqJV3vesrwrXWay+4AaVoOn1TPMgBc/YV9LOlSdectNC0Ig7iBqC0Mjo
>>>>>> kgeSQ0NJZSN99o4GKUnfwnd/fjDLzyi03XX5JkUMmEDLKPjT0LTmcnVSP5gu
>>>>>> GGdEDNNEfIyt8PZalB4HN1Ik0c4/YdQKpb6XjbejoN37NvYom+dwZsKk2g/J
>>>>>> Qa1bFDzvUZoTfax1yyMh2xu4b0rI6+a3bBhVBbY6Wz417aPRAhz09DecJoxt
>>>>>> 28jqn3Aj7ARETg5BTCn1gGjEWP4IytLKOvctukCFSnxJWKPumTMRqfTUnsKu
>>>>>> FxNjhSk5Kc+kVV7wQ7cU6NzxoBYHXMoEeamFXBmLooUG4lDKEeg0t+R9hPbT
>>>>>> ABCA
>>>>>> =yXJO
>>>>>> -----END PGP SIGNATURE-----
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 19:47                         ` Samuel Just
@ 2015-11-09 20:31                           ` Robert LeBlanc
  2015-11-09 20:49                             ` Samuel Just
  0 siblings, 1 reply; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-09 20:31 UTC (permalink / raw)
  To: Samuel Just; +Cc: Sage Weil, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just  wrote:
> What I really want from PrioritizedQueue (and from the dmclock/mclock
> approaches that are also being worked on) is a solution to the problem
> of efficiently deciding which op to do next taking into account
> fairness across io classes and ops with different costs.

> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Thanks, I think some of the fog is clearing. I was wondering how
>> operations between threads were keeping the order of operations in
>> PGs, that explains it.
>>
>> My original thoughts were to have a queue in front and behind the
>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
>> The queue thread would pull ops off that queue and place them into the
>> specialized queue, do house keeping, etc and would dequeue ops in that
>> queue to a post-queue that worker threads would monitor. The thread
>> queue could keep a certain amount of items in the post-queue to
>> prevent starvation and worker threads from being blocked.
>
> I'm not sure what the advantage of this would be -- it adds another thread
> to the processing pipeline at best.

There are a few reasons I thought about it. 1. It is hard to
prioritize/mange the work load if you can't see/manage all the
operations. One queue allows the algorithm to make decisions based on
all available information. (This point seems to be handled in a
different way in the future) 2. Reduce latency in the Op path. When an
OP is queued, there is overhead in getting it in the right place. When
an OP is dequeued there is more overhead in spreading tokens, etc.
Right now that is all serial, if an OP is stuck in the queue waiting
to be dispatched some of this overhead can't be performed while in
this waiting period. The idea is pushing that overhead to a separate
thread and allowing a worker thread to queue/dequeue in the most
efficient manner. It also allows for more complex trending,
scheduling, etc because it can sit outside of the OP path. As the
workload changes, it can dynamically change how it manages the queue
like simple fifo for low periods where latency is dominated by compute
time, to Token/WRR when latency is dominated by disk access, etc.

>> It would require the worker thread to be able to handle any kind of
>> op, or having separate post-queues for the different kinds of work.
>> I'm getting the feeling that this may be a far too simplistic approach
>> to the problem (or at least in terms of the organization of Ceph at
>> this point). I'm also starting to feel that I'm getting out of my
>> league trying to understand all the intricacies of the OSD work flow
>> (trying to start with one of the most complicated parts of the system
>> doesn't help).
>>
>> Maybe what I should do is just code up the queue to drop in as a
>> replacement for the Prio queue for the moment. Then as your async work
>> is completing we can shake out the potential issues with recovery and
>> costs that we talked about earlier. One thing that I'd like to look
>> into is elevating the priority of recovery ops that have client OPs
>> blocked. I don't think the WRR queue gives the recovery thread a lot
>> of time to get its work done.
>>
>
> If an op comes in that requires recovery to happen before it can be
> processed, we send the recovery messages with client priority rather
> than recovery priority.

But the recovery is still happening the recovery thread and not the
client thread, right? The recovery thread has a lower priority than
the op thread? That's how I understand it.

>> Based on some testing on Friday, the number of recovery ops on an osd
>> did not really change if there were 20 backfilling or 1 backfilling.
>> The difference came in with how many client I/Os were blocked waiting
>> for objects to recover. When 20 backfills were going, there were a lot
>> more blocked I/O waiting for objects to show up or recover. With one
>> backfill, there were far less blocked I/O, but there were still times
>> I/O would block.
>
> The number of recovery ops is actually a separate configurable
> (osd_recovery_max_active -- default to 15).  It's odd that with more
> backfilling on a single osd, there is more blocked IO.  Looking into
> that would be helpful and would probably give you some insight
> into recovery and the op processing pipeline.

I'll see what I can find here.

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQQJ0CRDmVDuy+mK58QAAeeUP/1uN/9EdqQDJdxW7fgeJ
/E0X49LmnnCigMPL5QJ3fpGjf44C0xcc9LN5IGJwwumHd5ozznpocy8Oj30N
+rNPJQ4dxcRao+bXUL/+DCQuY0wN/i7CqfMTW5PFmkdH4K9Lgce+bN6Q5Ora
q8JZvAxaZLCLZ10N+uiD5ghs+3X68hu4Da8SYQj0vjLs5gV4oATebF3JuYXW
GZ9qNfm2ygbeuT5Q0fhOKrvwJ9taKagMNrZLU10Wz5lHpGNitP3f17sVQznF
7ZCkZ+2oS+P4Lerchc3xB2qBJUoPJGSuGAUTSl/uUeyMoZT1+2LvLdNbJaio
UonoKJv47p4mpjo75x6FTWbJg0Ix+8/3/6oo3CkxC+6vOeWcv90B3TJGJPRz
tAayNB/1YpsVZ3QlHiuyC7+TdKofLRlMR21iAnAJkZ6FdgMz9SFk1Rp4vuyR
1qeZ+B4qA0m9ZWjx/G80j3fkUDY48EHR5gnI1k+WHFAh8KqT3eTRr37n9HH4
7wVakfPv89+HRjqrlA7WK5F89UVp1I+2kEmtPADCiwgh2wf0zn7Y5tA4FMXH
DIloZIRfvPwFtwpqgF7GR5vb/1dEOzD9Da0Zb7gBfsEfGaI2pJ+yvD1ad3BB
eqHQ05rl7s8meeX0H+6gWn9/f0JA65k2P2Y4N3YHk6OvKqIqnhreS9Tl4grH
MrBN
=Ju+O
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 20:31                           ` Robert LeBlanc
@ 2015-11-09 20:49                             ` Samuel Just
  2015-11-09 21:30                               ` Robert LeBlanc
  2015-11-10  0:39                               ` Milosz Tanski
  0 siblings, 2 replies; 24+ messages in thread
From: Samuel Just @ 2015-11-09 20:49 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, ceph-devel

On Mon, Nov 9, 2015 at 12:31 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just  wrote:
>> What I really want from PrioritizedQueue (and from the dmclock/mclock
>> approaches that are also being worked on) is a solution to the problem
>> of efficiently deciding which op to do next taking into account
>> fairness across io classes and ops with different costs.
>
>> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> Thanks, I think some of the fog is clearing. I was wondering how
>>> operations between threads were keeping the order of operations in
>>> PGs, that explains it.
>>>
>>> My original thoughts were to have a queue in front and behind the
>>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
>>> The queue thread would pull ops off that queue and place them into the
>>> specialized queue, do house keeping, etc and would dequeue ops in that
>>> queue to a post-queue that worker threads would monitor. The thread
>>> queue could keep a certain amount of items in the post-queue to
>>> prevent starvation and worker threads from being blocked.
>>
>> I'm not sure what the advantage of this would be -- it adds another thread
>> to the processing pipeline at best.
>
> There are a few reasons I thought about it. 1. It is hard to
> prioritize/mange the work load if you can't see/manage all the
> operations. One queue allows the algorithm to make decisions based on
> all available information. (This point seems to be handled in a
> different way in the future) 2. Reduce latency in the Op path. When an
> OP is queued, there is overhead in getting it in the right place. When
> an OP is dequeued there is more overhead in spreading tokens, etc.
> Right now that is all serial, if an OP is stuck in the queue waiting
> to be dispatched some of this overhead can't be performed while in
> this waiting period. The idea is pushing that overhead to a separate
> thread and allowing a worker thread to queue/dequeue in the most
> efficient manner. It also allows for more complex trending,
> scheduling, etc because it can sit outside of the OP path. As the
> workload changes, it can dynamically change how it manages the queue
> like simple fifo for low periods where latency is dominated by compute
> time, to Token/WRR when latency is dominated by disk access, etc.
>

We basically don't want a single thread to see all of the operations -- it
would cause a tremendous bottleneck and complicate the design
immensely.  It's shouldn't be necessary anyway since PGs are a form
of course grained locking, so it's probably fine to schedule work for
different groups of PGs independently if we assume that all kinds of
work are well distributed over those groups.

>>> It would require the worker thread to be able to handle any kind of
>>> op, or having separate post-queues for the different kinds of work.
>>> I'm getting the feeling that this may be a far too simplistic approach
>>> to the problem (or at least in terms of the organization of Ceph at
>>> this point). I'm also starting to feel that I'm getting out of my
>>> league trying to understand all the intricacies of the OSD work flow
>>> (trying to start with one of the most complicated parts of the system
>>> doesn't help).
>>>
>>> Maybe what I should do is just code up the queue to drop in as a
>>> replacement for the Prio queue for the moment. Then as your async work
>>> is completing we can shake out the potential issues with recovery and
>>> costs that we talked about earlier. One thing that I'd like to look
>>> into is elevating the priority of recovery ops that have client OPs
>>> blocked. I don't think the WRR queue gives the recovery thread a lot
>>> of time to get its work done.
>>>
>>
>> If an op comes in that requires recovery to happen before it can be
>> processed, we send the recovery messages with client priority rather
>> than recovery priority.
>
> But the recovery is still happening the recovery thread and not the
> client thread, right? The recovery thread has a lower priority than
> the op thread? That's how I understand it.
>

No, in hammer we removed the snap trim and scrub workqueues.  With
wip-recovery-wq, I remove the recovery wqs as well.  Ideally, the only
meaningful set of threads remaining will be the op_tp and associated
queues.

>>> Based on some testing on Friday, the number of recovery ops on an osd
>>> did not really change if there were 20 backfilling or 1 backfilling.
>>> The difference came in with how many client I/Os were blocked waiting
>>> for objects to recover. When 20 backfills were going, there were a lot
>>> more blocked I/O waiting for objects to show up or recover. With one
>>> backfill, there were far less blocked I/O, but there were still times
>>> I/O would block.
>>
>> The number of recovery ops is actually a separate configurable
>> (osd_recovery_max_active -- default to 15).  It's odd that with more
>> backfilling on a single osd, there is more blocked IO.  Looking into
>> that would be helpful and would probably give you some insight
>> into recovery and the op processing pipeline.
>
> I'll see what I can find here.
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWQQJ0CRDmVDuy+mK58QAAeeUP/1uN/9EdqQDJdxW7fgeJ
> /E0X49LmnnCigMPL5QJ3fpGjf44C0xcc9LN5IGJwwumHd5ozznpocy8Oj30N
> +rNPJQ4dxcRao+bXUL/+DCQuY0wN/i7CqfMTW5PFmkdH4K9Lgce+bN6Q5Ora
> q8JZvAxaZLCLZ10N+uiD5ghs+3X68hu4Da8SYQj0vjLs5gV4oATebF3JuYXW
> GZ9qNfm2ygbeuT5Q0fhOKrvwJ9taKagMNrZLU10Wz5lHpGNitP3f17sVQznF
> 7ZCkZ+2oS+P4Lerchc3xB2qBJUoPJGSuGAUTSl/uUeyMoZT1+2LvLdNbJaio
> UonoKJv47p4mpjo75x6FTWbJg0Ix+8/3/6oo3CkxC+6vOeWcv90B3TJGJPRz
> tAayNB/1YpsVZ3QlHiuyC7+TdKofLRlMR21iAnAJkZ6FdgMz9SFk1Rp4vuyR
> 1qeZ+B4qA0m9ZWjx/G80j3fkUDY48EHR5gnI1k+WHFAh8KqT3eTRr37n9HH4
> 7wVakfPv89+HRjqrlA7WK5F89UVp1I+2kEmtPADCiwgh2wf0zn7Y5tA4FMXH
> DIloZIRfvPwFtwpqgF7GR5vb/1dEOzD9Da0Zb7gBfsEfGaI2pJ+yvD1ad3BB
> eqHQ05rl7s8meeX0H+6gWn9/f0JA65k2P2Y4N3YHk6OvKqIqnhreS9Tl4grH
> MrBN
> =Ju+O
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 20:49                             ` Samuel Just
@ 2015-11-09 21:30                               ` Robert LeBlanc
  2015-11-09 22:35                                 ` Samuel Just
  2015-11-10  0:39                               ` Milosz Tanski
  1 sibling, 1 reply; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-09 21:30 UTC (permalink / raw)
  To: Samuel Just; +Cc: Sage Weil, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Mon, Nov 9, 2015 at 1:49 PM, Samuel Just  wrote:
> We basically don't want a single thread to see all of the operations -- it
> would cause a tremendous bottleneck and complicate the design
> immensely.  It's shouldn't be necessary anyway since PGs are a form
> of course grained locking, so it's probably fine to schedule work for
> different groups of PGs independently if we assume that all kinds of
> work are well distributed over those groups.

The only issue that I can see, based on the discussion last week, is
when the client I/O is small. There will be some points where each
thread will think it is OK so send a bolder along with the pebbles
(recovery I/O vs. client I/O), If all/most of the threads send a
bolder at the same time would it cause issues for slow disks
(spindles)? A single queue would be much more intelligent about
situations like this and spread the bolders out better. It also seems
more scalable as you add threads (I don't think really practical on
spindles). I assume the bottleneck in your concern is the thread
communication between threads? I'm trying to understand and in no way
trying to attack you (I've been know to come across differently than I
intend to).

>> But the recovery is still happening the recovery thread and not the
>> client thread, right? The recovery thread has a lower priority than
>> the op thread? That's how I understand it.
>>
>
> No, in hammer we removed the snap trim and scrub workqueues.  With
> wip-recovery-wq, I remove the recovery wqs as well.  Ideally, the only
> meaningful set of threads remaining will be the op_tp and associated
> queues.

OK, that is good news, I didn't do a scrub so I haven't seen the OPs
for that. Do you know the priorities of snap trim, scrub and recovery
so that I can do some math/logic on applying costs in an efficient way
as we talked about last week?

Thanks,

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQRB6CRDmVDuy+mK58QAAAsMP/RoBeyhqwNDURHagKJ9i
knjYW4jy0FFw1XmnFRhJN7FuFlYlHZ+bwvQGGYvmOkLlxgY9Y+J1GglwwV14
Vvtd/1LBOUw06Ch/WjhcgVFNIQdgdNBPHPaRurSTGxnofYKAwqB266gnzwAo
oX3EpgRskzrlwrOIg+b46Z3FhbdxYfJVqsWIEazIu9uFJDxf/pFimWSig0n1
bQsB0lZNeTbGKYww5GZqPtY3dVNqbfM6Xj5r5kxf5mhDZ2vKWJfvlc8nu86z
/VIDy5ZHPFZzv79wNlzNtZ9ofdmMT4n0Bhk8q4SFQSivs2z68DQxthcGXVaB
Bp5gy19QyE2mC6SeG3kwCYlEiGwJBGN5PVj9wDWrqDRiG/3eRS9yUs7N3RPW
hViKOYCt5lHBEhkkXaE824FweWZhupzXjiAjCMXYGtWek4LbLH9XFiMrigbR
b07EohO3cnXvrHL3+SmdEsHs0PIS0o9anyB7wn7Ze9oHQNYHXmzw48nzhth6
juGxCVeg80iNnlwpH/jQRfyEFB8rKfpJd7BLYdJgc/q4L25o/q588MeUqjUw
gc0cVkoKnegbz1fZ85CjI3YGXgXwRtVXFFl4Z+KdEJlEa1q9nRBGsho8LkT6
aanb77/QUJixLi7QQi8blXMvY0wjxzEkbtkoij0rL1OaxmKpoy/Nb8v6kyDL
rnL6
=IlY9
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 21:30                               ` Robert LeBlanc
@ 2015-11-09 22:35                                 ` Samuel Just
  2015-11-09 23:50                                   ` Robert LeBlanc
  0 siblings, 1 reply; 24+ messages in thread
From: Samuel Just @ 2015-11-09 22:35 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: Sage Weil, ceph-devel

On Mon, Nov 9, 2015 at 1:30 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On Mon, Nov 9, 2015 at 1:49 PM, Samuel Just  wrote:
>> We basically don't want a single thread to see all of the operations -- it
>> would cause a tremendous bottleneck and complicate the design
>> immensely.  It's shouldn't be necessary anyway since PGs are a form
>> of course grained locking, so it's probably fine to schedule work for
>> different groups of PGs independently if we assume that all kinds of
>> work are well distributed over those groups.
>
> The only issue that I can see, based on the discussion last week, is
> when the client I/O is small. There will be some points where each
> thread will think it is OK so send a bolder along with the pebbles
> (recovery I/O vs. client I/O), If all/most of the threads send a
> bolder at the same time would it cause issues for slow disks
> (spindles)? A single queue would be much more intelligent about
> situations like this and spread the bolders out better. It also seems
> more scalable as you add threads (I don't think really practical on
> spindles). I assume the bottleneck in your concern is the thread
> communication between threads? I'm trying to understand and in no way
> trying to attack you (I've been know to come across differently than I
> intend to).
>

This is one of the advantages of the dmclock/mclock based designs,
we'd be able to portion out the available IO (expresed as cost/time)
among the threads and let each queue schedule against its own
quota.  A significant challenge there of course is estimating available
io capacity. Another piece is that there needs to be a bound on how
large boulders get.  Recovery will break up recovery of large objects
into lots of messages to avoid having too large a boulder.  Similarly,
there are limits at least on the bulk size of a client IO operation.

I don't understand how a single queue would be more scalable as we
add threads.  Pre-giant, that's how the queue worked, and it was
indeed a significant bottleneck.

As I see it, each operation is ordered in two ways (each requiring
a lock/thread of control/something):
1) The message stream from the client is ordered (represented by
the reader thread in the SimpleMessenger).  The ordering here
is actually part of the librados interface contract for the most part
(certain reads could theoretically be reordered here without
breaking the rules).
2) Operations on the PG are ordered necessarily by the PG lock
(client writes by necessity, most everything else by convenience).

So at a minimum, something ordered by 1 needs to pass off to
something ordered by 2.  We currently do this by allowing the
reader thread to fast-dispatch directly into the op queue responsible
for the PG which owns the op.  A thread local to the right PG then
takes it from there.  This means that two different ops each of which
is on a different client/pg combo may not interact at all and could be
handled entirely in parallel (that's the ideal, anyway).  Depending on
what you mean by "queue", putting all ops in a single queue
necessarily serializes all IO on that structure (even if only for a small
portion of the execution time).  This limits both parallelism and
the amount of computation you can actually do to make the
scheduling decision even more so than the current design does.

Ideally, we'd like to have our cake and eat it too: we'd like good
scheduling (which PrioritizedQueue does not particularly well)
while minimizing overhead of the queue itself (an even bigger
problem with PrioritizedQueue) and keeping scaling as linear
as we can get it on many-core machines (which usually means
that independent ops should have a low probability of touching
the same structures).

>>> But the recovery is still happening the recovery thread and not the
>>> client thread, right? The recovery thread has a lower priority than
>>> the op thread? That's how I understand it.
>>>
>>
>> No, in hammer we removed the snap trim and scrub workqueues.  With
>> wip-recovery-wq, I remove the recovery wqs as well.  Ideally, the only
>> meaningful set of threads remaining will be the op_tp and associated
>> queues.
>
> OK, that is good news, I didn't do a scrub so I haven't seen the OPs
> for that. Do you know the priorities of snap trim, scrub and recovery
> so that I can do some math/logic on applying costs in an efficient way
> as we talked about last week?
>

There are config options in common/config_opt.h iirc.
-Sam

> Thanks,
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWQRB6CRDmVDuy+mK58QAAAsMP/RoBeyhqwNDURHagKJ9i
> knjYW4jy0FFw1XmnFRhJN7FuFlYlHZ+bwvQGGYvmOkLlxgY9Y+J1GglwwV14
> Vvtd/1LBOUw06Ch/WjhcgVFNIQdgdNBPHPaRurSTGxnofYKAwqB266gnzwAo
> oX3EpgRskzrlwrOIg+b46Z3FhbdxYfJVqsWIEazIu9uFJDxf/pFimWSig0n1
> bQsB0lZNeTbGKYww5GZqPtY3dVNqbfM6Xj5r5kxf5mhDZ2vKWJfvlc8nu86z
> /VIDy5ZHPFZzv79wNlzNtZ9ofdmMT4n0Bhk8q4SFQSivs2z68DQxthcGXVaB
> Bp5gy19QyE2mC6SeG3kwCYlEiGwJBGN5PVj9wDWrqDRiG/3eRS9yUs7N3RPW
> hViKOYCt5lHBEhkkXaE824FweWZhupzXjiAjCMXYGtWek4LbLH9XFiMrigbR
> b07EohO3cnXvrHL3+SmdEsHs0PIS0o9anyB7wn7Ze9oHQNYHXmzw48nzhth6
> juGxCVeg80iNnlwpH/jQRfyEFB8rKfpJd7BLYdJgc/q4L25o/q588MeUqjUw
> gc0cVkoKnegbz1fZ85CjI3YGXgXwRtVXFFl4Z+KdEJlEa1q9nRBGsho8LkT6
> aanb77/QUJixLi7QQi8blXMvY0wjxzEkbtkoij0rL1OaxmKpoy/Nb8v6kyDL
> rnL6
> =IlY9
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 22:35                                 ` Samuel Just
@ 2015-11-09 23:50                                   ` Robert LeBlanc
  0 siblings, 0 replies; 24+ messages in thread
From: Robert LeBlanc @ 2015-11-09 23:50 UTC (permalink / raw)
  To: ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

It sounds like dmclock/mclock will alleviate a lot of the concerns I
have as long as it can be smart like you said. It sounds like the
queue thread was already tried so there is experience behind the
current implementation vs. me thinking it might be better. The basic
idea I had is below:

Client,           Client,
Repop,            Repop,
backfill,   ...   Backfill,
recovery,         Recovery,
etc thread        etc thread
    |                 |
    \                 /
     \               /
lock; push (prio,cost,strict,
front/back,subsystem,&OP); unlock
            |
            |
     (queue thread) pop
        /          \
       /            \
if ops.low        Place op in prio
(fast path)       queue, do any
      |           housekeeping
      |                |
      |           when post-queue.len
      |           < threads
      \                /
       \              /
       post-queue push
              |
       lock, cond, pop
           /    \
          /      \
    Worker   ...  Worker
    thread        thread

What I meant by more scalable is that the rate of boulders would be
constant and evenly dispersed. It also prevents any one worker thread
from being backed up while others are idle. This may not be an issue
if the PG is busy. This design could also suffer if many OPs require
some locking at the PG level instead of the object level. The queue
itself does not do any op work only passing pointers to work to be
done. As I mentioned before it sounds like something like this already
proved to be limiting in performance, although thinking through this
has given me some ideas about implementing a fast path option in the
WRR queue to save some cycles.

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQS/lCRDmVDuy+mK58QAA69QP/0H1K3cArNaqM+yo4W4D
vpUMGxgTOg/8+69w4U2smHtjy8zRnJyUU1fbYdeTCbwTlZi5XVvtdMstDgPf
OqtF+uJm/akWVblzjreWjcqkBOXmlv89loOKJZGp9oUaHll8vrL117dd7Kwh
WHnGkc+fKCjkA7qo3gBo+Y5N3I1N2BNF0NQVuSTFEP5CfPE4Wy6DwBpYD1KY
zoN021E564V8eK1336je+v5xDg4oZLOxp5HhWmLHXnnisvfrK/VUipVl3aGY
Y5AXpdHGuRlsfvodKo6ZjAr1NEyPqlapJ7o57montY8yTxPR6ubSYAPP04Ky
VxA1FmtjsXKwui23rJMViWmY+lCT/P42fDlXEmVkbrnpkfoyzWn3N6yERatV
UCazWH6eA8w/FMjrkU7FTNjttYeQU74Ph26qywL9oNVWbzKKaiEaWgGzOT1Y
c65babw+qExK1syF8cWlKaf+roWIHeDq2+9iNO5SJ5v2eZ+JZipwW5f0BibM
EQGCx4b+vcjJgN2rYxUYOsm0tyOj+MMi2MrHqLC5Ns4zwqBw29+Gz4x+RfW5
2mw/0zaBe9v5GG7SocCHSuLexYBXjJ5h7zx2lII38Bnz9M6OfaAzuFtSXAqH
VSs4+6BrksnvAdhJNh4eX21mF/zIrnatxIvzvZlkAkSlEzpB72ZU8fC1OH/X
3hWW
=LuyT
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Request for Comments: Weighted Round Robin OP Queue
  2015-11-09 20:49                             ` Samuel Just
  2015-11-09 21:30                               ` Robert LeBlanc
@ 2015-11-10  0:39                               ` Milosz Tanski
  1 sibling, 0 replies; 24+ messages in thread
From: Milosz Tanski @ 2015-11-10  0:39 UTC (permalink / raw)
  To: Samuel Just; +Cc: Robert LeBlanc, Sage Weil, ceph-devel

On Mon, Nov 9, 2015 at 3:49 PM, Samuel Just <sjust@redhat.com> wrote:
> On Mon, Nov 9, 2015 at 12:31 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> On Mon, Nov 9, 2015 at 12:47 PM, Samuel Just  wrote:
>>> What I really want from PrioritizedQueue (and from the dmclock/mclock
>>> approaches that are also being worked on) is a solution to the problem
>>> of efficiently deciding which op to do next taking into account
>>> fairness across io classes and ops with different costs.
>>
>>> On Mon, Nov 9, 2015 at 11:19 AM, Robert LeBlanc  wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA256
>>>>
>>>> Thanks, I think some of the fog is clearing. I was wondering how
>>>> operations between threads were keeping the order of operations in
>>>> PGs, that explains it.
>>>>
>>>> My original thoughts were to have a queue in front and behind the
>>>> Prio/WRR queue. Threads scheduling work would queue to the pre-queue.
>>>> The queue thread would pull ops off that queue and place them into the
>>>> specialized queue, do house keeping, etc and would dequeue ops in that
>>>> queue to a post-queue that worker threads would monitor. The thread
>>>> queue could keep a certain amount of items in the post-queue to
>>>> prevent starvation and worker threads from being blocked.
>>>
>>> I'm not sure what the advantage of this would be -- it adds another thread
>>> to the processing pipeline at best.
>>
>> There are a few reasons I thought about it. 1. It is hard to
>> prioritize/mange the work load if you can't see/manage all the
>> operations. One queue allows the algorithm to make decisions based on
>> all available information. (This point seems to be handled in a
>> different way in the future) 2. Reduce latency in the Op path. When an
>> OP is queued, there is overhead in getting it in the right place. When
>> an OP is dequeued there is more overhead in spreading tokens, etc.
>> Right now that is all serial, if an OP is stuck in the queue waiting
>> to be dispatched some of this overhead can't be performed while in
>> this waiting period. The idea is pushing that overhead to a separate
>> thread and allowing a worker thread to queue/dequeue in the most
>> efficient manner. It also allows for more complex trending,
>> scheduling, etc because it can sit outside of the OP path. As the
>> workload changes, it can dynamically change how it manages the queue
>> like simple fifo for low periods where latency is dominated by compute
>> time, to Token/WRR when latency is dominated by disk access, etc.
>>
>
> We basically don't want a single thread to see all of the operations -- it
> would cause a tremendous bottleneck and complicate the design
> immensely.  It's shouldn't be necessary anyway since PGs are a form
> of course grained locking, so it's probably fine to schedule work for
> different groups of PGs independently if we assume that all kinds of
> work are well distributed over those groups.

There are are some queue implementations that rely on a single thread
essentially playing traffic cop in between queues and it's pretty
fast. FastFlow, the C++ lib, does that. It constructs other kinds of
queues from fast lock-free / wait-free SPSC queues. In the case of
something like MPMC there's a mediator thread there that manages N
SPSC in-queus to MSPC out-queues.

I'm only bringing this up since if you have a problem that might need
a mediator to arrange order, it's possible to do it fast.

>
>>>> It would require the worker thread to be able to handle any kind of
>>>> op, or having separate post-queues for the different kinds of work.
>>>> I'm getting the feeling that this may be a far too simplistic approach
>>>> to the problem (or at least in terms of the organization of Ceph at
>>>> this point). I'm also starting to feel that I'm getting out of my
>>>> league trying to understand all the intricacies of the OSD work flow
>>>> (trying to start with one of the most complicated parts of the system
>>>> doesn't help).
>>>>
>>>> Maybe what I should do is just code up the queue to drop in as a
>>>> replacement for the Prio queue for the moment. Then as your async work
>>>> is completing we can shake out the potential issues with recovery and
>>>> costs that we talked about earlier. One thing that I'd like to look
>>>> into is elevating the priority of recovery ops that have client OPs
>>>> blocked. I don't think the WRR queue gives the recovery thread a lot
>>>> of time to get its work done.
>>>>
>>>
>>> If an op comes in that requires recovery to happen before it can be
>>> processed, we send the recovery messages with client priority rather
>>> than recovery priority.
>>
>> But the recovery is still happening the recovery thread and not the
>> client thread, right? The recovery thread has a lower priority than
>> the op thread? That's how I understand it.
>>
>
> No, in hammer we removed the snap trim and scrub workqueues.  With
> wip-recovery-wq, I remove the recovery wqs as well.  Ideally, the only
> meaningful set of threads remaining will be the op_tp and associated
> queues.
>
>>>> Based on some testing on Friday, the number of recovery ops on an osd
>>>> did not really change if there were 20 backfilling or 1 backfilling.
>>>> The difference came in with how many client I/Os were blocked waiting
>>>> for objects to recover. When 20 backfills were going, there were a lot
>>>> more blocked I/O waiting for objects to show up or recover. With one
>>>> backfill, there were far less blocked I/O, but there were still times
>>>> I/O would block.
>>>
>>> The number of recovery ops is actually a separate configurable
>>> (osd_recovery_max_active -- default to 15).  It's odd that with more
>>> backfilling on a single osd, there is more blocked IO.  Looking into
>>> that would be helpful and would probably give you some insight
>>> into recovery and the op processing pipeline.
>>
>> I'll see what I can find here.
>>
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWQQJ0CRDmVDuy+mK58QAAeeUP/1uN/9EdqQDJdxW7fgeJ
>> /E0X49LmnnCigMPL5QJ3fpGjf44C0xcc9LN5IGJwwumHd5ozznpocy8Oj30N
>> +rNPJQ4dxcRao+bXUL/+DCQuY0wN/i7CqfMTW5PFmkdH4K9Lgce+bN6Q5Ora
>> q8JZvAxaZLCLZ10N+uiD5ghs+3X68hu4Da8SYQj0vjLs5gV4oATebF3JuYXW
>> GZ9qNfm2ygbeuT5Q0fhOKrvwJ9taKagMNrZLU10Wz5lHpGNitP3f17sVQznF
>> 7ZCkZ+2oS+P4Lerchc3xB2qBJUoPJGSuGAUTSl/uUeyMoZT1+2LvLdNbJaio
>> UonoKJv47p4mpjo75x6FTWbJg0Ix+8/3/6oo3CkxC+6vOeWcv90B3TJGJPRz
>> tAayNB/1YpsVZ3QlHiuyC7+TdKofLRlMR21iAnAJkZ6FdgMz9SFk1Rp4vuyR
>> 1qeZ+B4qA0m9ZWjx/G80j3fkUDY48EHR5gnI1k+WHFAh8KqT3eTRr37n9HH4
>> 7wVakfPv89+HRjqrlA7WK5F89UVp1I+2kEmtPADCiwgh2wf0zn7Y5tA4FMXH
>> DIloZIRfvPwFtwpqgF7GR5vb/1dEOzD9Da0Zb7gBfsEfGaI2pJ+yvD1ad3BB
>> eqHQ05rl7s8meeX0H+6gWn9/f0JA65k2P2Y4N3YHk6OvKqIqnhreS9Tl4grH
>> MrBN
>> =Ju+O
>> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-11-10  0:39 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-04 16:54 Request for Comments: Weighted Round Robin OP Queue Robert LeBlanc
2015-11-04 19:49 ` Samuel Just
2015-11-05  3:00   ` Robert LeBlanc
2015-11-05  3:20     ` Gregory Farnum
2015-11-05 15:14       ` Robert LeBlanc
2015-11-05 15:16         ` Mark Nelson
2015-11-05 15:46         ` Gregory Farnum
2015-11-06 10:12         ` Sage Weil
2015-11-06 17:03           ` Robert LeBlanc
2015-11-06 17:16             ` Milosz Tanski
2015-11-07  1:39             ` Robert LeBlanc
2015-11-08 14:20               ` Sage Weil
2015-11-09 16:49                 ` Samuel Just
2015-11-09 17:19                   ` Robert LeBlanc
2015-11-09 18:19                     ` Samuel Just
2015-11-09 18:55                       ` Haomai Wang
2015-11-09 19:19                       ` Robert LeBlanc
2015-11-09 19:47                         ` Samuel Just
2015-11-09 20:31                           ` Robert LeBlanc
2015-11-09 20:49                             ` Samuel Just
2015-11-09 21:30                               ` Robert LeBlanc
2015-11-09 22:35                                 ` Samuel Just
2015-11-09 23:50                                   ` Robert LeBlanc
2015-11-10  0:39                               ` Milosz Tanski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.