Async reads, sync writes, op thread model discussion

All of lore.kernel.org
 help / color / mirror / Atom feed

* Async reads, sync writes, op thread model discussion
@ 2015-08-11 16:40 Samuel Just
  2015-08-11 20:05 ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Just @ 2015-08-11 16:40 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Currently, there are some deficiencies in how the OSD maps ops onto threads:

1. Reads are always syncronous limiting the queue depth seen from the device
   and therefore the possible parallelism.
2. Writes are always asyncronous forcing even very fast writes to be completed
   in a seperate thread.
3. do_op cannot surrender the thread/pg lock during an operation forcing reads
   required to continue the operation to be syncronous.

For spinning disks, this is mostly ok since they don't benefit as much from
large read queues, and writes (filestore with journal) are too slow for the
thread switches to make a big difference.  For very fast flash, however, we
want the flexibility to allow the backend to perform writes syncronously or
asyncronously when it makes sense, and to maintain a larger number of
outstanding reads than we have threads.  To that end, I suggest changing the
ObjectStore interface to be somewhat polling based:

/// Create new token
void *create_operation_token() = 0;
bool is_operation_complete(void *token) = 0;
bool is_operation_committed(void *token) = 0;
bool is_operation_applied(void *token) = 0;
void wait_for_committed(void *token) = 0;
void wait_for_applied(void *token) = 0;
void wait_for_complete(void *token) = 0;
/// Get result of operation
int get_result(void *token) = 0;
/// Must only be called once is_opearation_complete(token)
void reset_operation_token(void *token) = 0;
/// Must only be called once is_opearation_complete(token)
void detroy_operation_token(void *token) = 0;

/**
 * Queue a transaction
 *
 * token must be either fresh or reset since the last operation.
 * If the operation is completed syncronously, token can be resused
 * without calling reset_operation_token.
 *
 * @result 0 if completed syncronously, -EAGAIN if async
 */
int queue_transaction(
  Transaction *t,
  OpSequencer *osr,
  void *token
  ) = 0;

/**
 * Queue a transaction
 *
 * token must be either fresh or reset since the last operation.
 * If the operation is completed syncronously, token can be resused
 * without calling reset_operation_token.
 *
 * @result -EAGAIN if async, 0 or -error otherwise.
 */
int read(..., void *token) = 0;
...

The "token" concept here is opaque to allow the implementation some
flexibility.  Ideally, it would be nice to be able to include libaio
operation contexts directly.

The main goal here is for the backend to have the freedom to complete
writes and reads asyncronously or syncronously as the sitation warrants.
It also leaves the interface user in control of where the operation
completion is handled.  Each op thread can therefore handle its own
completions:

struct InProgressOp {
  PGRef pg;
  ObjectStore::Token *token;
  OpContext *ctx;
};
vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
for (auto op : in_progress) {
  op.token = objectstore->create_operation_token();
}

uint64_t next_to_start = 0;
uint64_t next_to_complete = 0;

while (1) {
  if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
    InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
    objectstore->wait_for_complete(op.token);
  }
  for (; next_to_complete < next_to_start; ++next_to_complete) {
    InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
    if (objectstore->is_operation_complete(op.token)) {
      PGRef pg = op.pg;
      OpContext *ctx = op.ctx;
      op.pg = PGRef();
      op.ctx = nullptr;
      objectstore->reset_operation_token(op.token);
      if (pg->continue_op(
            ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
              == -EAGAIN) {
        ++next_to_start;
        continue;
      }
    } else {
      break;
    }
  }
  pair<OpRequestRef, PGRef> dq = // get new request from queue;
  if (dq.second->do_op(
        dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
          == -EAGAIN) {
    ++next_to_start;
  }
}

A design like this would allow the op thread to move onto another task if the
objectstore implementation wants to perform an async operation.  For this
to work, there is some work to be done:

1. All current reads in the read and write paths (probably including the attr
   reads in get_object_context and friends) need to be able to handle getting
   -EAGAIN from the objectstore.
2. Writes and reads need to be able to handle having the pg lock dropped
   during the operation.  This should be ok since the actual object information
   is protected by the RWState locks.
3. OpContext needs to have enough information to pick up where the operation
   left off.  This suggests that we should obtain all required ObjectContexts
   at the beginning of the operation.  Cache/Tiering complicates this.
4. The object class interface will need to be replaced with a new interface
   based on possibly async reads.  We can maintain compatibility with the
   current ones by launching a new thread to handle any message which happens
   to contain an old-style object class operation.

Most of this needs to happen to support object class operations on ec pools
anyway.
-Sam

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-11 16:40 Async reads, sync writes, op thread model discussion Samuel Just
@ 2015-08-11 20:05 ` Sage Weil
  2015-08-11 22:19   ` Samuel Just
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2015-08-11 20:05 UTC (permalink / raw)
  To: Samuel Just; +Cc: ceph-devel@vger.kernel.org

On Tue, 11 Aug 2015, Samuel Just wrote:
> Currently, there are some deficiencies in how the OSD maps ops onto threads:
> 
> 1. Reads are always syncronous limiting the queue depth seen from the device
>    and therefore the possible parallelism.
> 2. Writes are always asyncronous forcing even very fast writes to be completed
>    in a seperate thread.
> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>    required to continue the operation to be syncronous.
> 
> For spinning disks, this is mostly ok since they don't benefit as much from
> large read queues, and writes (filestore with journal) are too slow for the
> thread switches to make a big difference.  For very fast flash, however, we
> want the flexibility to allow the backend to perform writes syncronously or
> asyncronously when it makes sense, and to maintain a larger number of
> outstanding reads than we have threads.  To that end, I suggest changing the
> ObjectStore interface to be somewhat polling based:
> 
> /// Create new token
> void *create_operation_token() = 0;
> bool is_operation_complete(void *token) = 0;
> bool is_operation_committed(void *token) = 0;
> bool is_operation_applied(void *token) = 0;
> void wait_for_committed(void *token) = 0;
> void wait_for_applied(void *token) = 0;
> void wait_for_complete(void *token) = 0;
> /// Get result of operation
> int get_result(void *token) = 0;
> /// Must only be called once is_opearation_complete(token)
> void reset_operation_token(void *token) = 0;
> /// Must only be called once is_opearation_complete(token)
> void detroy_operation_token(void *token) = 0;
> 
> /**
>  * Queue a transaction
>  *
>  * token must be either fresh or reset since the last operation.
>  * If the operation is completed syncronously, token can be resused
>  * without calling reset_operation_token.
>  *
>  * @result 0 if completed syncronously, -EAGAIN if async
>  */
> int queue_transaction(
>   Transaction *t,
>   OpSequencer *osr,
>   void *token
>   ) = 0;
> 
> /**
>  * Queue a transaction
>  *
>  * token must be either fresh or reset since the last operation.
>  * If the operation is completed syncronously, token can be resused
>  * without calling reset_operation_token.
>  *
>  * @result -EAGAIN if async, 0 or -error otherwise.
>  */
> int read(..., void *token) = 0;
> ...
> 
> The "token" concept here is opaque to allow the implementation some
> flexibility.  Ideally, it would be nice to be able to include libaio
> operation contexts directly.
> 
> The main goal here is for the backend to have the freedom to complete
> writes and reads asyncronously or syncronously as the sitation warrants.
> It also leaves the interface user in control of where the operation
> completion is handled.  Each op thread can therefore handle its own
> completions:
> 
> struct InProgressOp {
>   PGRef pg;
>   ObjectStore::Token *token;
>   OpContext *ctx;
> };
> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);

Probably a deque<> since we'll be pushign new requests and slurping off 
completed ones?  Or, we can make token not completely opaque, so that it 
includes a boost::intrusive::list node and can be strung on a user-managed 
queue.

> for (auto op : in_progress) {
>   op.token = objectstore->create_operation_token();
> }
> 
> uint64_t next_to_start = 0;
> uint64_t next_to_complete = 0;
> 
> while (1) {
>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>     objectstore->wait_for_complete(op.token);
>   }
>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>     if (objectstore->is_operation_complete(op.token)) {
>       PGRef pg = op.pg;
>       OpContext *ctx = op.ctx;
>       op.pg = PGRef();
>       op.ctx = nullptr;
>       objectstore->reset_operation_token(op.token);
>       if (pg->continue_op(
>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>               == -EAGAIN) {
>         ++next_to_start;
>         continue;
>       }
>     } else {
>       break;
>     }
>   }
>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>   if (dq.second->do_op(
>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>           == -EAGAIN) {
>     ++next_to_start;
>   }
> }
> 
> A design like this would allow the op thread to move onto another task if the
> objectstore implementation wants to perform an async operation.  For this
> to work, there is some work to be done:
> 
> 1. All current reads in the read and write paths (probably including the attr
>    reads in get_object_context and friends) need to be able to handle getting
>    -EAGAIN from the objectstore.

Can we leave the old read methods in place as blocking versions, and have 
them block on the token before returning?  That'll make the transition 
less painful.

> 2. Writes and reads need to be able to handle having the pg lock dropped
>    during the operation.  This should be ok since the actual object information
>    is protected by the RWState locks.

All of the async write pieces already handle this (recheck PG state after 
taking the lock).  If they don't get -EAGAIN they'd just call the next 
stage, probably with a flag indicating that validation can be skipped 
(since the lock hasn't been dropped)?

> 3. OpContext needs to have enough information to pick up where the operation
>    left off.  This suggests that we should obtain all required ObjectContexts
>    at the beginning of the operation.  Cache/Tiering complicates this.

Yeah...

> 4. The object class interface will need to be replaced with a new interface
>    based on possibly async reads.  We can maintain compatibility with the
>    current ones by launching a new thread to handle any message which happens
>    to contain an old-style object class operation.

Again, for now, wrappers would avoid this?

s
> 
> Most of this needs to happen to support object class operations on ec pools
> anyway.
> -Sam
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-11 20:05 ` Sage Weil
@ 2015-08-11 22:19   ` Samuel Just
  2015-08-11 22:34     ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Just @ 2015-08-11 22:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
to the actual interface I presented so much as the notion that the
next thing to do is restructure the OpWQ users as async state
machines.
-Sam

On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 11 Aug 2015, Samuel Just wrote:
>> Currently, there are some deficiencies in how the OSD maps ops onto threads:
>>
>> 1. Reads are always syncronous limiting the queue depth seen from the device
>>    and therefore the possible parallelism.
>> 2. Writes are always asyncronous forcing even very fast writes to be completed
>>    in a seperate thread.
>> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>>    required to continue the operation to be syncronous.
>>
>> For spinning disks, this is mostly ok since they don't benefit as much from
>> large read queues, and writes (filestore with journal) are too slow for the
>> thread switches to make a big difference.  For very fast flash, however, we
>> want the flexibility to allow the backend to perform writes syncronously or
>> asyncronously when it makes sense, and to maintain a larger number of
>> outstanding reads than we have threads.  To that end, I suggest changing the
>> ObjectStore interface to be somewhat polling based:
>>
>> /// Create new token
>> void *create_operation_token() = 0;
>> bool is_operation_complete(void *token) = 0;
>> bool is_operation_committed(void *token) = 0;
>> bool is_operation_applied(void *token) = 0;
>> void wait_for_committed(void *token) = 0;
>> void wait_for_applied(void *token) = 0;
>> void wait_for_complete(void *token) = 0;
>> /// Get result of operation
>> int get_result(void *token) = 0;
>> /// Must only be called once is_opearation_complete(token)
>> void reset_operation_token(void *token) = 0;
>> /// Must only be called once is_opearation_complete(token)
>> void detroy_operation_token(void *token) = 0;
>>
>> /**
>>  * Queue a transaction
>>  *
>>  * token must be either fresh or reset since the last operation.
>>  * If the operation is completed syncronously, token can be resused
>>  * without calling reset_operation_token.
>>  *
>>  * @result 0 if completed syncronously, -EAGAIN if async
>>  */
>> int queue_transaction(
>>   Transaction *t,
>>   OpSequencer *osr,
>>   void *token
>>   ) = 0;
>>
>> /**
>>  * Queue a transaction
>>  *
>>  * token must be either fresh or reset since the last operation.
>>  * If the operation is completed syncronously, token can be resused
>>  * without calling reset_operation_token.
>>  *
>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>  */
>> int read(..., void *token) = 0;
>> ...
>>
>> The "token" concept here is opaque to allow the implementation some
>> flexibility.  Ideally, it would be nice to be able to include libaio
>> operation contexts directly.
>>
>> The main goal here is for the backend to have the freedom to complete
>> writes and reads asyncronously or syncronously as the sitation warrants.
>> It also leaves the interface user in control of where the operation
>> completion is handled.  Each op thread can therefore handle its own
>> completions:
>>
>> struct InProgressOp {
>>   PGRef pg;
>>   ObjectStore::Token *token;
>>   OpContext *ctx;
>> };
>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>
> Probably a deque<> since we'll be pushign new requests and slurping off
> completed ones?  Or, we can make token not completely opaque, so that it
> includes a boost::intrusive::list node and can be strung on a user-managed
> queue.
>
>> for (auto op : in_progress) {
>>   op.token = objectstore->create_operation_token();
>> }
>>
>> uint64_t next_to_start = 0;
>> uint64_t next_to_complete = 0;
>>
>> while (1) {
>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>     objectstore->wait_for_complete(op.token);
>>   }
>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>     if (objectstore->is_operation_complete(op.token)) {
>>       PGRef pg = op.pg;
>>       OpContext *ctx = op.ctx;
>>       op.pg = PGRef();
>>       op.ctx = nullptr;
>>       objectstore->reset_operation_token(op.token);
>>       if (pg->continue_op(
>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>               == -EAGAIN) {
>>         ++next_to_start;
>>         continue;
>>       }
>>     } else {
>>       break;
>>     }
>>   }
>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>   if (dq.second->do_op(
>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>           == -EAGAIN) {
>>     ++next_to_start;
>>   }
>> }
>>
>> A design like this would allow the op thread to move onto another task if the
>> objectstore implementation wants to perform an async operation.  For this
>> to work, there is some work to be done:
>>
>> 1. All current reads in the read and write paths (probably including the attr
>>    reads in get_object_context and friends) need to be able to handle getting
>>    -EAGAIN from the objectstore.
>
> Can we leave the old read methods in place as blocking versions, and have
> them block on the token before returning?  That'll make the transition
> less painful.
>
>> 2. Writes and reads need to be able to handle having the pg lock dropped
>>    during the operation.  This should be ok since the actual object information
>>    is protected by the RWState locks.
>
> All of the async write pieces already handle this (recheck PG state after
> taking the lock).  If they don't get -EAGAIN they'd just call the next
> stage, probably with a flag indicating that validation can be skipped
> (since the lock hasn't been dropped)?
>
>> 3. OpContext needs to have enough information to pick up where the operation
>>    left off.  This suggests that we should obtain all required ObjectContexts
>>    at the beginning of the operation.  Cache/Tiering complicates this.
>
> Yeah...
>
>> 4. The object class interface will need to be replaced with a new interface
>>    based on possibly async reads.  We can maintain compatibility with the
>>    current ones by launching a new thread to handle any message which happens
>>    to contain an old-style object class operation.
>
> Again, for now, wrappers would avoid this?
>
> s
>>
>> Most of this needs to happen to support object class operations on ec pools
>> anyway.
>> -Sam
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-11 22:19   ` Samuel Just
@ 2015-08-11 22:34     ` Yehuda Sadeh-Weinraub
  2015-08-12  2:50       ` Haomai Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-08-11 22:34 UTC (permalink / raw)
  To: Samuel Just; +Cc: Sage Weil, ceph-devel@vger.kernel.org

Already mentioned it on irc, adding to ceph-devel for the sake of
completeness. I did some infrastructure work for rgw and it seems (at
least to me) that it could at least be partially useful here.
Basically it's an async execution framework that utilizes coroutines.
It's comprised of aio notification manager that can also be tied into
coroutines execution. The coroutines themselves are stackless, they
are implemented as state machines, but using some boost trickery to
hide the details so they can be written very similar to blocking
methods. Coroutines can also execute other coroutines and can be
stacked, or can generate concurrent execution. It's still somewhat in
flux, but I think it's mostly done and already useful at this point,
so if there's anything you could use it might be a good idea to avoid
effort duplication.

Yehuda

On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
> to the actual interface I presented so much as the notion that the
> next thing to do is restructure the OpWQ users as async state
> machines.
> -Sam
>
> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>> Currently, there are some deficiencies in how the OSD maps ops onto threads:
>>>
>>> 1. Reads are always syncronous limiting the queue depth seen from the device
>>>    and therefore the possible parallelism.
>>> 2. Writes are always asyncronous forcing even very fast writes to be completed
>>>    in a seperate thread.
>>> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>>>    required to continue the operation to be syncronous.
>>>
>>> For spinning disks, this is mostly ok since they don't benefit as much from
>>> large read queues, and writes (filestore with journal) are too slow for the
>>> thread switches to make a big difference.  For very fast flash, however, we
>>> want the flexibility to allow the backend to perform writes syncronously or
>>> asyncronously when it makes sense, and to maintain a larger number of
>>> outstanding reads than we have threads.  To that end, I suggest changing the
>>> ObjectStore interface to be somewhat polling based:
>>>
>>> /// Create new token
>>> void *create_operation_token() = 0;
>>> bool is_operation_complete(void *token) = 0;
>>> bool is_operation_committed(void *token) = 0;
>>> bool is_operation_applied(void *token) = 0;
>>> void wait_for_committed(void *token) = 0;
>>> void wait_for_applied(void *token) = 0;
>>> void wait_for_complete(void *token) = 0;
>>> /// Get result of operation
>>> int get_result(void *token) = 0;
>>> /// Must only be called once is_opearation_complete(token)
>>> void reset_operation_token(void *token) = 0;
>>> /// Must only be called once is_opearation_complete(token)
>>> void detroy_operation_token(void *token) = 0;
>>>
>>> /**
>>>  * Queue a transaction
>>>  *
>>>  * token must be either fresh or reset since the last operation.
>>>  * If the operation is completed syncronously, token can be resused
>>>  * without calling reset_operation_token.
>>>  *
>>>  * @result 0 if completed syncronously, -EAGAIN if async
>>>  */
>>> int queue_transaction(
>>>   Transaction *t,
>>>   OpSequencer *osr,
>>>   void *token
>>>   ) = 0;
>>>
>>> /**
>>>  * Queue a transaction
>>>  *
>>>  * token must be either fresh or reset since the last operation.
>>>  * If the operation is completed syncronously, token can be resused
>>>  * without calling reset_operation_token.
>>>  *
>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>  */
>>> int read(..., void *token) = 0;
>>> ...
>>>
>>> The "token" concept here is opaque to allow the implementation some
>>> flexibility.  Ideally, it would be nice to be able to include libaio
>>> operation contexts directly.
>>>
>>> The main goal here is for the backend to have the freedom to complete
>>> writes and reads asyncronously or syncronously as the sitation warrants.
>>> It also leaves the interface user in control of where the operation
>>> completion is handled.  Each op thread can therefore handle its own
>>> completions:
>>>
>>> struct InProgressOp {
>>>   PGRef pg;
>>>   ObjectStore::Token *token;
>>>   OpContext *ctx;
>>> };
>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>
>> Probably a deque<> since we'll be pushign new requests and slurping off
>> completed ones?  Or, we can make token not completely opaque, so that it
>> includes a boost::intrusive::list node and can be strung on a user-managed
>> queue.
>>
>>> for (auto op : in_progress) {
>>>   op.token = objectstore->create_operation_token();
>>> }
>>>
>>> uint64_t next_to_start = 0;
>>> uint64_t next_to_complete = 0;
>>>
>>> while (1) {
>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>     objectstore->wait_for_complete(op.token);
>>>   }
>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>     if (objectstore->is_operation_complete(op.token)) {
>>>       PGRef pg = op.pg;
>>>       OpContext *ctx = op.ctx;
>>>       op.pg = PGRef();
>>>       op.ctx = nullptr;
>>>       objectstore->reset_operation_token(op.token);
>>>       if (pg->continue_op(
>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>               == -EAGAIN) {
>>>         ++next_to_start;
>>>         continue;
>>>       }
>>>     } else {
>>>       break;
>>>     }
>>>   }
>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>   if (dq.second->do_op(
>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>           == -EAGAIN) {
>>>     ++next_to_start;
>>>   }
>>> }
>>>
>>> A design like this would allow the op thread to move onto another task if the
>>> objectstore implementation wants to perform an async operation.  For this
>>> to work, there is some work to be done:
>>>
>>> 1. All current reads in the read and write paths (probably including the attr
>>>    reads in get_object_context and friends) need to be able to handle getting
>>>    -EAGAIN from the objectstore.
>>
>> Can we leave the old read methods in place as blocking versions, and have
>> them block on the token before returning?  That'll make the transition
>> less painful.
>>
>>> 2. Writes and reads need to be able to handle having the pg lock dropped
>>>    during the operation.  This should be ok since the actual object information
>>>    is protected by the RWState locks.
>>
>> All of the async write pieces already handle this (recheck PG state after
>> taking the lock).  If they don't get -EAGAIN they'd just call the next
>> stage, probably with a flag indicating that validation can be skipped
>> (since the lock hasn't been dropped)?
>>
>>> 3. OpContext needs to have enough information to pick up where the operation
>>>    left off.  This suggests that we should obtain all required ObjectContexts
>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>
>> Yeah...
>>
>>> 4. The object class interface will need to be replaced with a new interface
>>>    based on possibly async reads.  We can maintain compatibility with the
>>>    current ones by launching a new thread to handle any message which happens
>>>    to contain an old-style object class operation.
>>
>> Again, for now, wrappers would avoid this?
>>
>> s
>>>
>>> Most of this needs to happen to support object class operations on ec pools
>>> anyway.
>>> -Sam
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-11 22:34     ` Yehuda Sadeh-Weinraub
@ 2015-08-12  2:50       ` Haomai Wang
  2015-08-12  6:55         ` Somnath Roy
  2015-08-14 20:56         ` Milosz Tanski
  0 siblings, 2 replies; 17+ messages in thread
From: Haomai Wang @ 2015-08-12  2:50 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: Samuel Just, Sage Weil, ceph-devel@vger.kernel.org

On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
<ysadehwe@redhat.com> wrote:
> Already mentioned it on irc, adding to ceph-devel for the sake of
> completeness. I did some infrastructure work for rgw and it seems (at
> least to me) that it could at least be partially useful here.
> Basically it's an async execution framework that utilizes coroutines.
> It's comprised of aio notification manager that can also be tied into
> coroutines execution. The coroutines themselves are stackless, they
> are implemented as state machines, but using some boost trickery to
> hide the details so they can be written very similar to blocking
> methods. Coroutines can also execute other coroutines and can be
> stacked, or can generate concurrent execution. It's still somewhat in
> flux, but I think it's mostly done and already useful at this point,
> so if there's anything you could use it might be a good idea to avoid
> effort duplication.
>

coroutines like qemu is cool. The only thing I afraid is the
complicate of debug and it's really a big task :-(

I agree with sage that this design is really a new implementation for
objectstore so that it's harmful to existing objectstore impl. I also
suffer the pain from sync read xattr, we may add a async read
interface to solove this?

For context switch thing, now we have at least 3 cs for one op at osd
side. messenger -> op queue -> objectstore queue. I guess op queue ->
objectstore is easier to kick off just as sam said. We can make write
journal inline with queue_transaction, so the caller could directly
handle the transaction right now.

Anyway, I think we need to do some changes for this field.

> Yehuda
>
> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
>> to the actual interface I presented so much as the notion that the
>> next thing to do is restructure the OpWQ users as async state
>> machines.
>> -Sam
>>
>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>>> Currently, there are some deficiencies in how the OSD maps ops onto threads:
>>>>
>>>> 1. Reads are always syncronous limiting the queue depth seen from the device
>>>>    and therefore the possible parallelism.
>>>> 2. Writes are always asyncronous forcing even very fast writes to be completed
>>>>    in a seperate thread.
>>>> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>>>>    required to continue the operation to be syncronous.
>>>>
>>>> For spinning disks, this is mostly ok since they don't benefit as much from
>>>> large read queues, and writes (filestore with journal) are too slow for the
>>>> thread switches to make a big difference.  For very fast flash, however, we
>>>> want the flexibility to allow the backend to perform writes syncronously or
>>>> asyncronously when it makes sense, and to maintain a larger number of
>>>> outstanding reads than we have threads.  To that end, I suggest changing the
>>>> ObjectStore interface to be somewhat polling based:
>>>>
>>>> /// Create new token
>>>> void *create_operation_token() = 0;
>>>> bool is_operation_complete(void *token) = 0;
>>>> bool is_operation_committed(void *token) = 0;
>>>> bool is_operation_applied(void *token) = 0;
>>>> void wait_for_committed(void *token) = 0;
>>>> void wait_for_applied(void *token) = 0;
>>>> void wait_for_complete(void *token) = 0;
>>>> /// Get result of operation
>>>> int get_result(void *token) = 0;
>>>> /// Must only be called once is_opearation_complete(token)
>>>> void reset_operation_token(void *token) = 0;
>>>> /// Must only be called once is_opearation_complete(token)
>>>> void detroy_operation_token(void *token) = 0;
>>>>
>>>> /**
>>>>  * Queue a transaction
>>>>  *
>>>>  * token must be either fresh or reset since the last operation.
>>>>  * If the operation is completed syncronously, token can be resused
>>>>  * without calling reset_operation_token.
>>>>  *
>>>>  * @result 0 if completed syncronously, -EAGAIN if async
>>>>  */
>>>> int queue_transaction(
>>>>   Transaction *t,
>>>>   OpSequencer *osr,
>>>>   void *token
>>>>   ) = 0;
>>>>
>>>> /**
>>>>  * Queue a transaction
>>>>  *
>>>>  * token must be either fresh or reset since the last operation.
>>>>  * If the operation is completed syncronously, token can be resused
>>>>  * without calling reset_operation_token.
>>>>  *
>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>>  */
>>>> int read(..., void *token) = 0;
>>>> ...
>>>>
>>>> The "token" concept here is opaque to allow the implementation some
>>>> flexibility.  Ideally, it would be nice to be able to include libaio
>>>> operation contexts directly.
>>>>
>>>> The main goal here is for the backend to have the freedom to complete
>>>> writes and reads asyncronously or syncronously as the sitation warrants.
>>>> It also leaves the interface user in control of where the operation
>>>> completion is handled.  Each op thread can therefore handle its own
>>>> completions:
>>>>
>>>> struct InProgressOp {
>>>>   PGRef pg;
>>>>   ObjectStore::Token *token;
>>>>   OpContext *ctx;
>>>> };
>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>>
>>> Probably a deque<> since we'll be pushign new requests and slurping off
>>> completed ones?  Or, we can make token not completely opaque, so that it
>>> includes a boost::intrusive::list node and can be strung on a user-managed
>>> queue.
>>>
>>>> for (auto op : in_progress) {
>>>>   op.token = objectstore->create_operation_token();
>>>> }
>>>>
>>>> uint64_t next_to_start = 0;
>>>> uint64_t next_to_complete = 0;
>>>>
>>>> while (1) {
>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>     objectstore->wait_for_complete(op.token);
>>>>   }
>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>     if (objectstore->is_operation_complete(op.token)) {
>>>>       PGRef pg = op.pg;
>>>>       OpContext *ctx = op.ctx;
>>>>       op.pg = PGRef();
>>>>       op.ctx = nullptr;
>>>>       objectstore->reset_operation_token(op.token);
>>>>       if (pg->continue_op(
>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>               == -EAGAIN) {
>>>>         ++next_to_start;
>>>>         continue;
>>>>       }
>>>>     } else {
>>>>       break;
>>>>     }
>>>>   }
>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>>   if (dq.second->do_op(
>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>           == -EAGAIN) {
>>>>     ++next_to_start;
>>>>   }
>>>> }
>>>>
>>>> A design like this would allow the op thread to move onto another task if the
>>>> objectstore implementation wants to perform an async operation.  For this
>>>> to work, there is some work to be done:
>>>>
>>>> 1. All current reads in the read and write paths (probably including the attr
>>>>    reads in get_object_context and friends) need to be able to handle getting
>>>>    -EAGAIN from the objectstore.
>>>
>>> Can we leave the old read methods in place as blocking versions, and have
>>> them block on the token before returning?  That'll make the transition
>>> less painful.
>>>
>>>> 2. Writes and reads need to be able to handle having the pg lock dropped
>>>>    during the operation.  This should be ok since the actual object information
>>>>    is protected by the RWState locks.
>>>
>>> All of the async write pieces already handle this (recheck PG state after
>>> taking the lock).  If they don't get -EAGAIN they'd just call the next
>>> stage, probably with a flag indicating that validation can be skipped
>>> (since the lock hasn't been dropped)?
>>>
>>>> 3. OpContext needs to have enough information to pick up where the operation
>>>>    left off.  This suggests that we should obtain all required ObjectContexts
>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>>
>>> Yeah...
>>>
>>>> 4. The object class interface will need to be replaced with a new interface
>>>>    based on possibly async reads.  We can maintain compatibility with the
>>>>    current ones by launching a new thread to handle any message which happens
>>>>    to contain an old-style object class operation.
>>>
>>> Again, for now, wrappers would avoid this?
>>>
>>> s
>>>>
>>>> Most of this needs to happen to support object class operations on ec pools
>>>> anyway.
>>>> -Sam
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Async reads, sync writes, op thread model discussion
  2015-08-12  2:50       ` Haomai Wang
@ 2015-08-12  6:55         ` Somnath Roy
  2015-08-12  7:04           ` Haomai Wang
  2015-08-14 20:56         ` Milosz Tanski
  1 sibling, 1 reply; 17+ messages in thread
From: Somnath Roy @ 2015-08-12  6:55 UTC (permalink / raw)
  To: Haomai Wang, Yehuda Sadeh-Weinraub
  Cc: Samuel Just, Sage Weil, ceph-devel@vger.kernel.org

Haomai,
Yes, one of the goals is to make async read xattr..
IMO, this scheme should benefit in the following scenario..

Ops within a PG will not be serialized any more as long as it is not coming on the same object and this could be a big win.


In our workload at least we are not seeing the shard queue depth is going high indicating no bottleneck from workQ end. I am doubtful of having async completion in such case would be helpful.
Overall, I agree that this is the way to go forward...

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Tuesday, August 11, 2015 7:50 PM
To: Yehuda Sadeh-Weinraub
Cc: Samuel Just; Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: Async reads, sync writes, op thread model discussion

On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub <ysadehwe@redhat.com> wrote:
> Already mentioned it on irc, adding to ceph-devel for the sake of
> completeness. I did some infrastructure work for rgw and it seems (at
> least to me) that it could at least be partially useful here.
> Basically it's an async execution framework that utilizes coroutines.
> It's comprised of aio notification manager that can also be tied into
> coroutines execution. The coroutines themselves are stackless, they
> are implemented as state machines, but using some boost trickery to
> hide the details so they can be written very similar to blocking
> methods. Coroutines can also execute other coroutines and can be
> stacked, or can generate concurrent execution. It's still somewhat in
> flux, but I think it's mostly done and already useful at this point,
> so if there's anything you could use it might be a good idea to avoid
> effort duplication.
>

coroutines like qemu is cool. The only thing I afraid is the complicate of debug and it's really a big task :-(

I agree with sage that this design is really a new implementation for objectstore so that it's harmful to existing objectstore impl. I also suffer the pain from sync read xattr, we may add a async read interface to solove this?

For context switch thing, now we have at least 3 cs for one op at osd side. messenger -> op queue -> objectstore queue. I guess op queue -> objectstore is easier to kick off just as sam said. We can make write journal inline with queue_transaction, so the caller could directly handle the transaction right now.

Anyway, I think we need to do some changes for this field.

> Yehuda
>
> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
>> to the actual interface I presented so much as the notion that the
>> next thing to do is restructure the OpWQ users as async state
>> machines.
>> -Sam
>>
>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>>> Currently, there are some deficiencies in how the OSD maps ops onto threads:
>>>>
>>>> 1. Reads are always syncronous limiting the queue depth seen from the device
>>>>    and therefore the possible parallelism.
>>>> 2. Writes are always asyncronous forcing even very fast writes to be completed
>>>>    in a seperate thread.
>>>> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>>>>    required to continue the operation to be syncronous.
>>>>
>>>> For spinning disks, this is mostly ok since they don't benefit as
>>>> much from large read queues, and writes (filestore with journal)
>>>> are too slow for the thread switches to make a big difference.  For
>>>> very fast flash, however, we want the flexibility to allow the
>>>> backend to perform writes syncronously or asyncronously when it
>>>> makes sense, and to maintain a larger number of outstanding reads
>>>> than we have threads.  To that end, I suggest changing the ObjectStore interface to be somewhat polling based:
>>>>
>>>> /// Create new token
>>>> void *create_operation_token() = 0; bool is_operation_complete(void
>>>> *token) = 0; bool is_operation_committed(void *token) = 0; bool
>>>> is_operation_applied(void *token) = 0; void wait_for_committed(void
>>>> *token) = 0; void wait_for_applied(void *token) = 0; void
>>>> wait_for_complete(void *token) = 0; /// Get result of operation int
>>>> get_result(void *token) = 0; /// Must only be called once
>>>> is_opearation_complete(token) void reset_operation_token(void
>>>> *token) = 0; /// Must only be called once
>>>> is_opearation_complete(token) void detroy_operation_token(void
>>>> *token) = 0;
>>>>
>>>> /**
>>>>  * Queue a transaction
>>>>  *
>>>>  * token must be either fresh or reset since the last operation.
>>>>  * If the operation is completed syncronously, token can be resused
>>>>  * without calling reset_operation_token.
>>>>  *
>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */ int
>>>> queue_transaction(
>>>>   Transaction *t,
>>>>   OpSequencer *osr,
>>>>   void *token
>>>>   ) = 0;
>>>>
>>>> /**
>>>>  * Queue a transaction
>>>>  *
>>>>  * token must be either fresh or reset since the last operation.
>>>>  * If the operation is completed syncronously, token can be resused
>>>>  * without calling reset_operation_token.
>>>>  *
>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>>  */
>>>> int read(..., void *token) = 0;
>>>> ...
>>>>
>>>> The "token" concept here is opaque to allow the implementation some
>>>> flexibility.  Ideally, it would be nice to be able to include
>>>> libaio operation contexts directly.
>>>>
>>>> The main goal here is for the backend to have the freedom to
>>>> complete writes and reads asyncronously or syncronously as the sitation warrants.
>>>> It also leaves the interface user in control of where the operation
>>>> completion is handled.  Each op thread can therefore handle its own
>>>> completions:
>>>>
>>>> struct InProgressOp {
>>>>   PGRef pg;
>>>>   ObjectStore::Token *token;
>>>>   OpContext *ctx;
>>>> };
>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>>
>>> Probably a deque<> since we'll be pushign new requests and slurping
>>> off completed ones?  Or, we can make token not completely opaque, so
>>> that it includes a boost::intrusive::list node and can be strung on
>>> a user-managed queue.
>>>
>>>> for (auto op : in_progress) {
>>>>   op.token = objectstore->create_operation_token();
>>>> }
>>>>
>>>> uint64_t next_to_start = 0;
>>>> uint64_t next_to_complete = 0;
>>>>
>>>> while (1) {
>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>     objectstore->wait_for_complete(op.token);
>>>>   }
>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>     if (objectstore->is_operation_complete(op.token)) {
>>>>       PGRef pg = op.pg;
>>>>       OpContext *ctx = op.ctx;
>>>>       op.pg = PGRef();
>>>>       op.ctx = nullptr;
>>>>       objectstore->reset_operation_token(op.token);
>>>>       if (pg->continue_op(
>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>               == -EAGAIN) {
>>>>         ++next_to_start;
>>>>         continue;
>>>>       }
>>>>     } else {
>>>>       break;
>>>>     }
>>>>   }
>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>>   if (dq.second->do_op(
>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>           == -EAGAIN) {
>>>>     ++next_to_start;
>>>>   }
>>>> }
>>>>
>>>> A design like this would allow the op thread to move onto another
>>>> task if the objectstore implementation wants to perform an async
>>>> operation.  For this to work, there is some work to be done:
>>>>
>>>> 1. All current reads in the read and write paths (probably including the attr
>>>>    reads in get_object_context and friends) need to be able to handle getting
>>>>    -EAGAIN from the objectstore.
>>>
>>> Can we leave the old read methods in place as blocking versions, and
>>> have them block on the token before returning?  That'll make the
>>> transition less painful.
>>>
>>>> 2. Writes and reads need to be able to handle having the pg lock dropped
>>>>    during the operation.  This should be ok since the actual object information
>>>>    is protected by the RWState locks.
>>>
>>> All of the async write pieces already handle this (recheck PG state
>>> after taking the lock).  If they don't get -EAGAIN they'd just call
>>> the next stage, probably with a flag indicating that validation can
>>> be skipped (since the lock hasn't been dropped)?
>>>
>>>> 3. OpContext needs to have enough information to pick up where the operation
>>>>    left off.  This suggests that we should obtain all required ObjectContexts
>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>>
>>> Yeah...
>>>
>>>> 4. The object class interface will need to be replaced with a new interface
>>>>    based on possibly async reads.  We can maintain compatibility with the
>>>>    current ones by launching a new thread to handle any message which happens
>>>>    to contain an old-style object class operation.
>>>
>>> Again, for now, wrappers would avoid this?
>>>
>>> s
>>>>
>>>> Most of this needs to happen to support object class operations on
>>>> ec pools anyway.
>>>> -Sam
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html



--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-12  6:55         ` Somnath Roy
@ 2015-08-12  7:04           ` Haomai Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Haomai Wang @ 2015-08-12  7:04 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Yehuda Sadeh-Weinraub, Samuel Just, Sage Weil,
	ceph-devel@vger.kernel.org

On Wed, Aug 12, 2015 at 2:55 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Haomai,
> Yes, one of the goals is to make async read xattr..
> IMO, this scheme should benefit in the following scenario..
>
> Ops within a PG will not be serialized any more as long as it is not coming on the same object and this could be a big win.
>
>
> In our workload at least we are not seeing the shard queue depth is going high indicating no bottleneck from workQ end. I am doubtful of having async completion in such case would be helpful.
> Overall, I agree that this is the way to go forward...

Yes, although it doesn't hit high depth, it doesn't mean that io
latency isn't affect by sync xattr read :-). We could observe at least
100ms for deep depth inode reading. Like this
PR(https://github.com/ceph/ceph/pull/5497/files#diff-72747d40a424e7b5404366b557ff12a3),
it's suffered from the high latency from fd read.

So if we have async read, we can read object as early as possible?

The second, if we use fiemap read, we could issue multi async read op
and reap them. I guess it would benefit to rbd case. Although I'm not
sure how many people enable fiemap

>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Tuesday, August 11, 2015 7:50 PM
> To: Yehuda Sadeh-Weinraub
> Cc: Samuel Just; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: Async reads, sync writes, op thread model discussion
>
> On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub <ysadehwe@redhat.com> wrote:
>> Already mentioned it on irc, adding to ceph-devel for the sake of
>> completeness. I did some infrastructure work for rgw and it seems (at
>> least to me) that it could at least be partially useful here.
>> Basically it's an async execution framework that utilizes coroutines.
>> It's comprised of aio notification manager that can also be tied into
>> coroutines execution. The coroutines themselves are stackless, they
>> are implemented as state machines, but using some boost trickery to
>> hide the details so they can be written very similar to blocking
>> methods. Coroutines can also execute other coroutines and can be
>> stacked, or can generate concurrent execution. It's still somewhat in
>> flux, but I think it's mostly done and already useful at this point,
>> so if there's anything you could use it might be a good idea to avoid
>> effort duplication.
>>
>
> coroutines like qemu is cool. The only thing I afraid is the complicate of debug and it's really a big task :-(
>
> I agree with sage that this design is really a new implementation for objectstore so that it's harmful to existing objectstore impl. I also suffer the pain from sync read xattr, we may add a async read interface to solove this?
>
> For context switch thing, now we have at least 3 cs for one op at osd side. messenger -> op queue -> objectstore queue. I guess op queue -> objectstore is easier to kick off just as sam said. We can make write journal inline with queue_transaction, so the caller could directly handle the transaction right now.
>
> Anyway, I think we need to do some changes for this field.
>
>> Yehuda
>>
>> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
>>> to the actual interface I presented so much as the notion that the
>>> next thing to do is restructure the OpWQ users as async state
>>> machines.
>>> -Sam
>>>
>>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>>>> Currently, there are some deficiencies in how the OSD maps ops onto threads:
>>>>>
>>>>> 1. Reads are always syncronous limiting the queue depth seen from the device
>>>>>    and therefore the possible parallelism.
>>>>> 2. Writes are always asyncronous forcing even very fast writes to be completed
>>>>>    in a seperate thread.
>>>>> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>>>>>    required to continue the operation to be syncronous.
>>>>>
>>>>> For spinning disks, this is mostly ok since they don't benefit as
>>>>> much from large read queues, and writes (filestore with journal)
>>>>> are too slow for the thread switches to make a big difference.  For
>>>>> very fast flash, however, we want the flexibility to allow the
>>>>> backend to perform writes syncronously or asyncronously when it
>>>>> makes sense, and to maintain a larger number of outstanding reads
>>>>> than we have threads.  To that end, I suggest changing the ObjectStore interface to be somewhat polling based:
>>>>>
>>>>> /// Create new token
>>>>> void *create_operation_token() = 0; bool is_operation_complete(void
>>>>> *token) = 0; bool is_operation_committed(void *token) = 0; bool
>>>>> is_operation_applied(void *token) = 0; void wait_for_committed(void
>>>>> *token) = 0; void wait_for_applied(void *token) = 0; void
>>>>> wait_for_complete(void *token) = 0; /// Get result of operation int
>>>>> get_result(void *token) = 0; /// Must only be called once
>>>>> is_opearation_complete(token) void reset_operation_token(void
>>>>> *token) = 0; /// Must only be called once
>>>>> is_opearation_complete(token) void detroy_operation_token(void
>>>>> *token) = 0;
>>>>>
>>>>> /**
>>>>>  * Queue a transaction
>>>>>  *
>>>>>  * token must be either fresh or reset since the last operation.
>>>>>  * If the operation is completed syncronously, token can be resused
>>>>>  * without calling reset_operation_token.
>>>>>  *
>>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */ int
>>>>> queue_transaction(
>>>>>   Transaction *t,
>>>>>   OpSequencer *osr,
>>>>>   void *token
>>>>>   ) = 0;
>>>>>
>>>>> /**
>>>>>  * Queue a transaction
>>>>>  *
>>>>>  * token must be either fresh or reset since the last operation.
>>>>>  * If the operation is completed syncronously, token can be resused
>>>>>  * without calling reset_operation_token.
>>>>>  *
>>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>>>  */
>>>>> int read(..., void *token) = 0;
>>>>> ...
>>>>>
>>>>> The "token" concept here is opaque to allow the implementation some
>>>>> flexibility.  Ideally, it would be nice to be able to include
>>>>> libaio operation contexts directly.
>>>>>
>>>>> The main goal here is for the backend to have the freedom to
>>>>> complete writes and reads asyncronously or syncronously as the sitation warrants.
>>>>> It also leaves the interface user in control of where the operation
>>>>> completion is handled.  Each op thread can therefore handle its own
>>>>> completions:
>>>>>
>>>>> struct InProgressOp {
>>>>>   PGRef pg;
>>>>>   ObjectStore::Token *token;
>>>>>   OpContext *ctx;
>>>>> };
>>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>>>
>>>> Probably a deque<> since we'll be pushign new requests and slurping
>>>> off completed ones?  Or, we can make token not completely opaque, so
>>>> that it includes a boost::intrusive::list node and can be strung on
>>>> a user-managed queue.
>>>>
>>>>> for (auto op : in_progress) {
>>>>>   op.token = objectstore->create_operation_token();
>>>>> }
>>>>>
>>>>> uint64_t next_to_start = 0;
>>>>> uint64_t next_to_complete = 0;
>>>>>
>>>>> while (1) {
>>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>>     objectstore->wait_for_complete(op.token);
>>>>>   }
>>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>>     if (objectstore->is_operation_complete(op.token)) {
>>>>>       PGRef pg = op.pg;
>>>>>       OpContext *ctx = op.ctx;
>>>>>       op.pg = PGRef();
>>>>>       op.ctx = nullptr;
>>>>>       objectstore->reset_operation_token(op.token);
>>>>>       if (pg->continue_op(
>>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>>               == -EAGAIN) {
>>>>>         ++next_to_start;
>>>>>         continue;
>>>>>       }
>>>>>     } else {
>>>>>       break;
>>>>>     }
>>>>>   }
>>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>>>   if (dq.second->do_op(
>>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>>           == -EAGAIN) {
>>>>>     ++next_to_start;
>>>>>   }
>>>>> }
>>>>>
>>>>> A design like this would allow the op thread to move onto another
>>>>> task if the objectstore implementation wants to perform an async
>>>>> operation.  For this to work, there is some work to be done:
>>>>>
>>>>> 1. All current reads in the read and write paths (probably including the attr
>>>>>    reads in get_object_context and friends) need to be able to handle getting
>>>>>    -EAGAIN from the objectstore.
>>>>
>>>> Can we leave the old read methods in place as blocking versions, and
>>>> have them block on the token before returning?  That'll make the
>>>> transition less painful.
>>>>
>>>>> 2. Writes and reads need to be able to handle having the pg lock dropped
>>>>>    during the operation.  This should be ok since the actual object information
>>>>>    is protected by the RWState locks.
>>>>
>>>> All of the async write pieces already handle this (recheck PG state
>>>> after taking the lock).  If they don't get -EAGAIN they'd just call
>>>> the next stage, probably with a flag indicating that validation can
>>>> be skipped (since the lock hasn't been dropped)?
>>>>
>>>>> 3. OpContext needs to have enough information to pick up where the operation
>>>>>    left off.  This suggests that we should obtain all required ObjectContexts
>>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>>>
>>>> Yeah...
>>>>
>>>>> 4. The object class interface will need to be replaced with a new interface
>>>>>    based on possibly async reads.  We can maintain compatibility with the
>>>>>    current ones by launching a new thread to handle any message which happens
>>>>>    to contain an old-style object class operation.
>>>>
>>>> Again, for now, wrappers would avoid this?
>>>>
>>>> s
>>>>>
>>>>> Most of this needs to happen to support object class operations on
>>>>> ec pools anyway.
>>>>> -Sam
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-12  2:50       ` Haomai Wang
  2015-08-12  6:55         ` Somnath Roy
@ 2015-08-14 20:56         ` Milosz Tanski
  2015-08-14 21:19           ` Matt Benjamin
  1 sibling, 1 reply; 17+ messages in thread
From: Milosz Tanski @ 2015-08-14 20:56 UTC (permalink / raw)
  To: Haomai Wang
  Cc: Yehuda Sadeh-Weinraub, Samuel Just, Sage Weil,
	ceph-devel@vger.kernel.org

On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
> <ysadehwe@redhat.com> wrote:
>> Already mentioned it on irc, adding to ceph-devel for the sake of
>> completeness. I did some infrastructure work for rgw and it seems (at
>> least to me) that it could at least be partially useful here.
>> Basically it's an async execution framework that utilizes coroutines.
>> It's comprised of aio notification manager that can also be tied into
>> coroutines execution. The coroutines themselves are stackless, they
>> are implemented as state machines, but using some boost trickery to
>> hide the details so they can be written very similar to blocking
>> methods. Coroutines can also execute other coroutines and can be
>> stacked, or can generate concurrent execution. It's still somewhat in
>> flux, but I think it's mostly done and already useful at this point,
>> so if there's anything you could use it might be a good idea to avoid
>> effort duplication.
>>
>
> coroutines like qemu is cool. The only thing I afraid is the
> complicate of debug and it's really a big task :-(
>
> I agree with sage that this design is really a new implementation for
> objectstore so that it's harmful to existing objectstore impl. I also
> suffer the pain from sync read xattr, we may add a async read
> interface to solove this?
>
> For context switch thing, now we have at least 3 cs for one op at osd
> side. messenger -> op queue -> objectstore queue. I guess op queue ->
> objectstore is easier to kick off just as sam said. We can make write
> journal inline with queue_transaction, so the caller could directly
> handle the transaction right now.

I would caution agains coroutines (fibers) esp. in a multi-threaded
environment. Posix has officially obsoleted the swapcontext family of
functions in 1003.1-2004 and removed it in 1003.1-2008. That's because
they were notoriously non portable, and buggy. And yes you can use
something like boost::context / boost::coroutine instead but they also
have platform limitations. These implementations tend to abuse / turn
of various platform scrutiny features (like the one for
setjmp/longjmp). And on top of that many platforms don't consider
alternative context so you end up with obscure bugs. I've debugged my
fair share of bugs in Mordor coroutines with C++ exceptions, and errno
variables (since errno is really a function on linux and it's output a
pointer to threads errno is marked pure) if your coroutine migrates
threads. And you need to migrate them because of blocking and uneven
processor/thread distribution.

None of these are obstacles that can't be solved, but added together
they become a pretty long term liability. So I think long and hard
about it. Qemu doesn't have some of those issues because it's uses a
single thread and a much simpler C ABI that it deals with.

An alternative to coroutines that goes a long way towards solving the
callback spaghetti problem is futures/promises. I'm not talking of the
very future model that exists in C++11 library but more along the
lines that exist in other languages (like what's being done in
Javascript today). There's a good implementation of it Folly (the
facebook c++11 library). They have a very nice piece of documentation
here to understand how they work and how they differ.

That future model is very handy when dealing with the callback control
flow problem. You can chain a bunch of processing steps that requires
some async action, return a future and continue so on and so forth.
Also, it makes handling complex error cases easy by giving you a way
to skip lots of processing steps strait to onError at the end of the
chain.

Take a look at folly. Take a look at the expanded boost futures (they
call this future continuations:
http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.futures.then
). Also, building a cut down future framework just for Ceph (or
reduced set folly one) might be another option.

Just an alternative.

>
> Anyway, I think we need to do some changes for this field.
>
>> Yehuda
>>
>> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
>>> to the actual interface I presented so much as the notion that the
>>> next thing to do is restructure the OpWQ users as async state
>>> machines.
>>> -Sam
>>>
>>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>>>> Currently, there are some deficiencies in how the OSD maps ops onto threads:
>>>>>
>>>>> 1. Reads are always syncronous limiting the queue depth seen from the device
>>>>>    and therefore the possible parallelism.
>>>>> 2. Writes are always asyncronous forcing even very fast writes to be completed
>>>>>    in a seperate thread.
>>>>> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>>>>>    required to continue the operation to be syncronous.
>>>>>
>>>>> For spinning disks, this is mostly ok since they don't benefit as much from
>>>>> large read queues, and writes (filestore with journal) are too slow for the
>>>>> thread switches to make a big difference.  For very fast flash, however, we
>>>>> want the flexibility to allow the backend to perform writes syncronously or
>>>>> asyncronously when it makes sense, and to maintain a larger number of
>>>>> outstanding reads than we have threads.  To that end, I suggest changing the
>>>>> ObjectStore interface to be somewhat polling based:
>>>>>
>>>>> /// Create new token
>>>>> void *create_operation_token() = 0;
>>>>> bool is_operation_complete(void *token) = 0;
>>>>> bool is_operation_committed(void *token) = 0;
>>>>> bool is_operation_applied(void *token) = 0;
>>>>> void wait_for_committed(void *token) = 0;
>>>>> void wait_for_applied(void *token) = 0;
>>>>> void wait_for_complete(void *token) = 0;
>>>>> /// Get result of operation
>>>>> int get_result(void *token) = 0;
>>>>> /// Must only be called once is_opearation_complete(token)
>>>>> void reset_operation_token(void *token) = 0;
>>>>> /// Must only be called once is_opearation_complete(token)
>>>>> void detroy_operation_token(void *token) = 0;
>>>>>
>>>>> /**
>>>>>  * Queue a transaction
>>>>>  *
>>>>>  * token must be either fresh or reset since the last operation.
>>>>>  * If the operation is completed syncronously, token can be resused
>>>>>  * without calling reset_operation_token.
>>>>>  *
>>>>>  * @result 0 if completed syncronously, -EAGAIN if async
>>>>>  */
>>>>> int queue_transaction(
>>>>>   Transaction *t,
>>>>>   OpSequencer *osr,
>>>>>   void *token
>>>>>   ) = 0;
>>>>>
>>>>> /**
>>>>>  * Queue a transaction
>>>>>  *
>>>>>  * token must be either fresh or reset since the last operation.
>>>>>  * If the operation is completed syncronously, token can be resused
>>>>>  * without calling reset_operation_token.
>>>>>  *
>>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>>>  */
>>>>> int read(..., void *token) = 0;
>>>>> ...
>>>>>
>>>>> The "token" concept here is opaque to allow the implementation some
>>>>> flexibility.  Ideally, it would be nice to be able to include libaio
>>>>> operation contexts directly.
>>>>>
>>>>> The main goal here is for the backend to have the freedom to complete
>>>>> writes and reads asyncronously or syncronously as the sitation warrants.
>>>>> It also leaves the interface user in control of where the operation
>>>>> completion is handled.  Each op thread can therefore handle its own
>>>>> completions:
>>>>>
>>>>> struct InProgressOp {
>>>>>   PGRef pg;
>>>>>   ObjectStore::Token *token;
>>>>>   OpContext *ctx;
>>>>> };
>>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>>>
>>>> Probably a deque<> since we'll be pushign new requests and slurping off
>>>> completed ones?  Or, we can make token not completely opaque, so that it
>>>> includes a boost::intrusive::list node and can be strung on a user-managed
>>>> queue.
>>>>
>>>>> for (auto op : in_progress) {
>>>>>   op.token = objectstore->create_operation_token();
>>>>> }
>>>>>
>>>>> uint64_t next_to_start = 0;
>>>>> uint64_t next_to_complete = 0;
>>>>>
>>>>> while (1) {
>>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>>     objectstore->wait_for_complete(op.token);
>>>>>   }
>>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>>     if (objectstore->is_operation_complete(op.token)) {
>>>>>       PGRef pg = op.pg;
>>>>>       OpContext *ctx = op.ctx;
>>>>>       op.pg = PGRef();
>>>>>       op.ctx = nullptr;
>>>>>       objectstore->reset_operation_token(op.token);
>>>>>       if (pg->continue_op(
>>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>>               == -EAGAIN) {
>>>>>         ++next_to_start;
>>>>>         continue;
>>>>>       }
>>>>>     } else {
>>>>>       break;
>>>>>     }
>>>>>   }
>>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>>>   if (dq.second->do_op(
>>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>>           == -EAGAIN) {
>>>>>     ++next_to_start;
>>>>>   }
>>>>> }
>>>>>
>>>>> A design like this would allow the op thread to move onto another task if the
>>>>> objectstore implementation wants to perform an async operation.  For this
>>>>> to work, there is some work to be done:
>>>>>
>>>>> 1. All current reads in the read and write paths (probably including the attr
>>>>>    reads in get_object_context and friends) need to be able to handle getting
>>>>>    -EAGAIN from the objectstore.
>>>>
>>>> Can we leave the old read methods in place as blocking versions, and have
>>>> them block on the token before returning?  That'll make the transition
>>>> less painful.
>>>>
>>>>> 2. Writes and reads need to be able to handle having the pg lock dropped
>>>>>    during the operation.  This should be ok since the actual object information
>>>>>    is protected by the RWState locks.
>>>>
>>>> All of the async write pieces already handle this (recheck PG state after
>>>> taking the lock).  If they don't get -EAGAIN they'd just call the next
>>>> stage, probably with a flag indicating that validation can be skipped
>>>> (since the lock hasn't been dropped)?
>>>>
>>>>> 3. OpContext needs to have enough information to pick up where the operation
>>>>>    left off.  This suggests that we should obtain all required ObjectContexts
>>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>>>
>>>> Yeah...
>>>>
>>>>> 4. The object class interface will need to be replaced with a new interface
>>>>>    based on possibly async reads.  We can maintain compatibility with the
>>>>>    current ones by launching a new thread to handle any message which happens
>>>>>    to contain an old-style object class operation.
>>>>
>>>> Again, for now, wrappers would avoid this?
>>>>
>>>> s
>>>>>
>>>>> Most of this needs to happen to support object class operations on ec pools
>>>>> anyway.
>>>>> -Sam
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-14 20:56         ` Milosz Tanski
@ 2015-08-14 21:19           ` Matt Benjamin
  2015-08-14 21:39             ` Blinick, Stephen L
  2015-08-14 22:36             ` Milosz Tanski
  0 siblings, 2 replies; 17+ messages in thread
From: Matt Benjamin @ 2015-08-14 21:19 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Haomai Wang, Yehuda Sadeh-Weinraub, Samuel Just, Sage Weil,
	ceph-devel

Hi,

I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.

I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.

My intuition is, since the goal is more deterministic performance in a short horizion, you

a. need to prioritize transparency over novel abstractions
b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline

My .05.

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

----- Original Message -----
> From: "Milosz Tanski" <milosz@adfin.com>
> To: "Haomai Wang" <haomaiwang@gmail.com>
> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Samuel Just" <sjust@redhat.com>, "Sage Weil" <sage@newdream.net>,
> ceph-devel@vger.kernel.org
> Sent: Friday, August 14, 2015 4:56:26 PM
> Subject: Re: Async reads, sync writes, op thread model discussion
> 
> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
> > <ysadehwe@redhat.com> wrote:
> >> Already mentioned it on irc, adding to ceph-devel for the sake of
> >> completeness. I did some infrastructure work for rgw and it seems (at
> >> least to me) that it could at least be partially useful here.
> >> Basically it's an async execution framework that utilizes coroutines.
> >> It's comprised of aio notification manager that can also be tied into
> >> coroutines execution. The coroutines themselves are stackless, they
> >> are implemented as state machines, but using some boost trickery to
> >> hide the details so they can be written very similar to blocking
> >> methods. Coroutines can also execute other coroutines and can be
> >> stacked, or can generate concurrent execution. It's still somewhat in
> >> flux, but I think it's mostly done and already useful at this point,
> >> so if there's anything you could use it might be a good idea to avoid
> >> effort duplication.
> >>
> >
> > coroutines like qemu is cool. The only thing I afraid is the
> > complicate of debug and it's really a big task :-(
> >
> > I agree with sage that this design is really a new implementation for
> > objectstore so that it's harmful to existing objectstore impl. I also
> > suffer the pain from sync read xattr, we may add a async read
> > interface to solove this?
> >
> > For context switch thing, now we have at least 3 cs for one op at osd
> > side. messenger -> op queue -> objectstore queue. I guess op queue ->
> > objectstore is easier to kick off just as sam said. We can make write
> > journal inline with queue_transaction, so the caller could directly
> > handle the transaction right now.
> 
> I would caution agains coroutines (fibers) esp. in a multi-threaded
> environment. Posix has officially obsoleted the swapcontext family of
> functions in 1003.1-2004 and removed it in 1003.1-2008. That's because
> they were notoriously non portable, and buggy. And yes you can use
> something like boost::context / boost::coroutine instead but they also
> have platform limitations. These implementations tend to abuse / turn
> of various platform scrutiny features (like the one for
> setjmp/longjmp). And on top of that many platforms don't consider
> alternative context so you end up with obscure bugs. I've debugged my
> fair share of bugs in Mordor coroutines with C++ exceptions, and errno
> variables (since errno is really a function on linux and it's output a
> pointer to threads errno is marked pure) if your coroutine migrates
> threads. And you need to migrate them because of blocking and uneven
> processor/thread distribution.
> 
> None of these are obstacles that can't be solved, but added together
> they become a pretty long term liability. So I think long and hard
> about it. Qemu doesn't have some of those issues because it's uses a
> single thread and a much simpler C ABI that it deals with.
> 
> An alternative to coroutines that goes a long way towards solving the
> callback spaghetti problem is futures/promises. I'm not talking of the
> very future model that exists in C++11 library but more along the
> lines that exist in other languages (like what's being done in
> Javascript today). There's a good implementation of it Folly (the
> facebook c++11 library). They have a very nice piece of documentation
> here to understand how they work and how they differ.
> 
> That future model is very handy when dealing with the callback control
> flow problem. You can chain a bunch of processing steps that requires
> some async action, return a future and continue so on and so forth.
> Also, it makes handling complex error cases easy by giving you a way
> to skip lots of processing steps strait to onError at the end of the
> chain.
> 
> Take a look at folly. Take a look at the expanded boost futures (they
> call this future continuations:
> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.futures.then
> ). Also, building a cut down future framework just for Ceph (or
> reduced set folly one) might be another option.
> 
> Just an alternative.
> 
> >
> > Anyway, I think we need to do some changes for this field.
> >
> >> Yehuda
> >>
> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
> >>> to the actual interface I presented so much as the notion that the
> >>> next thing to do is restructure the OpWQ users as async state
> >>> machines.
> >>> -Sam
> >>>
> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
> >>>>> Currently, there are some deficiencies in how the OSD maps ops onto
> >>>>> threads:
> >>>>>
> >>>>> 1. Reads are always syncronous limiting the queue depth seen from the
> >>>>> device
> >>>>>    and therefore the possible parallelism.
> >>>>> 2. Writes are always asyncronous forcing even very fast writes to be
> >>>>> completed
> >>>>>    in a seperate thread.
> >>>>> 3. do_op cannot surrender the thread/pg lock during an operation
> >>>>> forcing reads
> >>>>>    required to continue the operation to be syncronous.
> >>>>>
> >>>>> For spinning disks, this is mostly ok since they don't benefit as much
> >>>>> from
> >>>>> large read queues, and writes (filestore with journal) are too slow for
> >>>>> the
> >>>>> thread switches to make a big difference.  For very fast flash,
> >>>>> however, we
> >>>>> want the flexibility to allow the backend to perform writes
> >>>>> syncronously or
> >>>>> asyncronously when it makes sense, and to maintain a larger number of
> >>>>> outstanding reads than we have threads.  To that end, I suggest
> >>>>> changing the
> >>>>> ObjectStore interface to be somewhat polling based:
> >>>>>
> >>>>> /// Create new token
> >>>>> void *create_operation_token() = 0;
> >>>>> bool is_operation_complete(void *token) = 0;
> >>>>> bool is_operation_committed(void *token) = 0;
> >>>>> bool is_operation_applied(void *token) = 0;
> >>>>> void wait_for_committed(void *token) = 0;
> >>>>> void wait_for_applied(void *token) = 0;
> >>>>> void wait_for_complete(void *token) = 0;
> >>>>> /// Get result of operation
> >>>>> int get_result(void *token) = 0;
> >>>>> /// Must only be called once is_opearation_complete(token)
> >>>>> void reset_operation_token(void *token) = 0;
> >>>>> /// Must only be called once is_opearation_complete(token)
> >>>>> void detroy_operation_token(void *token) = 0;
> >>>>>
> >>>>> /**
> >>>>>  * Queue a transaction
> >>>>>  *
> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>>  * If the operation is completed syncronously, token can be resused
> >>>>>  * without calling reset_operation_token.
> >>>>>  *
> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async
> >>>>>  */
> >>>>> int queue_transaction(
> >>>>>   Transaction *t,
> >>>>>   OpSequencer *osr,
> >>>>>   void *token
> >>>>>   ) = 0;
> >>>>>
> >>>>> /**
> >>>>>  * Queue a transaction
> >>>>>  *
> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>>  * If the operation is completed syncronously, token can be resused
> >>>>>  * without calling reset_operation_token.
> >>>>>  *
> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
> >>>>>  */
> >>>>> int read(..., void *token) = 0;
> >>>>> ...
> >>>>>
> >>>>> The "token" concept here is opaque to allow the implementation some
> >>>>> flexibility.  Ideally, it would be nice to be able to include libaio
> >>>>> operation contexts directly.
> >>>>>
> >>>>> The main goal here is for the backend to have the freedom to complete
> >>>>> writes and reads asyncronously or syncronously as the sitation
> >>>>> warrants.
> >>>>> It also leaves the interface user in control of where the operation
> >>>>> completion is handled.  Each op thread can therefore handle its own
> >>>>> completions:
> >>>>>
> >>>>> struct InProgressOp {
> >>>>>   PGRef pg;
> >>>>>   ObjectStore::Token *token;
> >>>>>   OpContext *ctx;
> >>>>> };
> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
> >>>>
> >>>> Probably a deque<> since we'll be pushign new requests and slurping off
> >>>> completed ones?  Or, we can make token not completely opaque, so that it
> >>>> includes a boost::intrusive::list node and can be strung on a
> >>>> user-managed
> >>>> queue.
> >>>>
> >>>>> for (auto op : in_progress) {
> >>>>>   op.token = objectstore->create_operation_token();
> >>>>> }
> >>>>>
> >>>>> uint64_t next_to_start = 0;
> >>>>> uint64_t next_to_complete = 0;
> >>>>>
> >>>>> while (1) {
> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
> >>>>>     objectstore->wait_for_complete(op.token);
> >>>>>   }
> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
> >>>>>     if (objectstore->is_operation_complete(op.token)) {
> >>>>>       PGRef pg = op.pg;
> >>>>>       OpContext *ctx = op.ctx;
> >>>>>       op.pg = PGRef();
> >>>>>       op.ctx = nullptr;
> >>>>>       objectstore->reset_operation_token(op.token);
> >>>>>       if (pg->continue_op(
> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
> >>>>>               == -EAGAIN) {
> >>>>>         ++next_to_start;
> >>>>>         continue;
> >>>>>       }
> >>>>>     } else {
> >>>>>       break;
> >>>>>     }
> >>>>>   }
> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
> >>>>>   if (dq.second->do_op(
> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
> >>>>>           == -EAGAIN) {
> >>>>>     ++next_to_start;
> >>>>>   }
> >>>>> }
> >>>>>
> >>>>> A design like this would allow the op thread to move onto another task
> >>>>> if the
> >>>>> objectstore implementation wants to perform an async operation.  For
> >>>>> this
> >>>>> to work, there is some work to be done:
> >>>>>
> >>>>> 1. All current reads in the read and write paths (probably including
> >>>>> the attr
> >>>>>    reads in get_object_context and friends) need to be able to handle
> >>>>>    getting
> >>>>>    -EAGAIN from the objectstore.
> >>>>
> >>>> Can we leave the old read methods in place as blocking versions, and
> >>>> have
> >>>> them block on the token before returning?  That'll make the transition
> >>>> less painful.
> >>>>
> >>>>> 2. Writes and reads need to be able to handle having the pg lock
> >>>>> dropped
> >>>>>    during the operation.  This should be ok since the actual object
> >>>>>    information
> >>>>>    is protected by the RWState locks.
> >>>>
> >>>> All of the async write pieces already handle this (recheck PG state
> >>>> after
> >>>> taking the lock).  If they don't get -EAGAIN they'd just call the next
> >>>> stage, probably with a flag indicating that validation can be skipped
> >>>> (since the lock hasn't been dropped)?
> >>>>
> >>>>> 3. OpContext needs to have enough information to pick up where the
> >>>>> operation
> >>>>>    left off.  This suggests that we should obtain all required
> >>>>>    ObjectContexts
> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
> >>>>
> >>>> Yeah...
> >>>>
> >>>>> 4. The object class interface will need to be replaced with a new
> >>>>> interface
> >>>>>    based on possibly async reads.  We can maintain compatibility with
> >>>>>    the
> >>>>>    current ones by launching a new thread to handle any message which
> >>>>>    happens
> >>>>>    to contain an old-style object class operation.
> >>>>
> >>>> Again, for now, wrappers would avoid this?
> >>>>
> >>>> s
> >>>>>
> >>>>> Most of this needs to happen to support object class operations on ec
> >>>>> pools
> >>>>> anyway.
> >>>>> -Sam
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in
> >>>>> the body of a message to majordomo@vger.kernel.org
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
> 
> p: 646-253-9055
> e: milosz@adfin.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Async reads, sync writes, op thread model discussion
  2015-08-14 21:19           ` Matt Benjamin
@ 2015-08-14 21:39             ` Blinick, Stephen L
  2015-08-14 22:36             ` Milosz Tanski
  1 sibling, 0 replies; 17+ messages in thread
From: Blinick, Stephen L @ 2015-08-14 21:39 UTC (permalink / raw)
  To: Matt Benjamin, Milosz Tanski
  Cc: Haomai Wang, Yehuda Sadeh-Weinraub, Samuel Just, Sage Weil,
	ceph-devel@vger.kernel.org

Based on distant past experience (in more of an embedded system) using co-routines, I'd vote for perhaps not implementing them right away, as it becomes complex to follow and debug.    As I understood it, each worker would have an additional queue to pull from: new incoming work and re-dispatching completed operations. In each case there's a limited set of states the operations will be in as they are pre-empted at specific points.  Also having discrete queues for new work vs pending operations can allow balancing between the two which may be necessary.

Either way I agree that deterministic operation (as well as a short per IO code path length for sync operations) would be the best outcome.

Thanks,

Stephen


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Matt Benjamin
Sent: Friday, August 14, 2015 2:19 PM
To: Milosz Tanski
Cc: Haomai Wang; Yehuda Sadeh-Weinraub; Samuel Just; Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: Async reads, sync writes, op thread model discussion

Hi,

I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.

I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.

My intuition is, since the goal is more deterministic performance in a short horizion, you

a. need to prioritize transparency over novel abstractions b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline

My .05.

Matt

--
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

----- Original Message -----
> From: "Milosz Tanski" <milosz@adfin.com>
> To: "Haomai Wang" <haomaiwang@gmail.com>
> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Samuel Just" 
> <sjust@redhat.com>, "Sage Weil" <sage@newdream.net>, 
> ceph-devel@vger.kernel.org
> Sent: Friday, August 14, 2015 4:56:26 PM
> Subject: Re: Async reads, sync writes, op thread model discussion
> 
> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub 
> > <ysadehwe@redhat.com> wrote:
> >> Already mentioned it on irc, adding to ceph-devel for the sake of 
> >> completeness. I did some infrastructure work for rgw and it seems 
> >> (at least to me) that it could at least be partially useful here.
> >> Basically it's an async execution framework that utilizes coroutines.
> >> It's comprised of aio notification manager that can also be tied 
> >> into coroutines execution. The coroutines themselves are stackless, 
> >> they are implemented as state machines, but using some boost 
> >> trickery to hide the details so they can be written very similar to 
> >> blocking methods. Coroutines can also execute other coroutines and 
> >> can be stacked, or can generate concurrent execution. It's still 
> >> somewhat in flux, but I think it's mostly done and already useful 
> >> at this point, so if there's anything you could use it might be a 
> >> good idea to avoid effort duplication.
> >>
> >
> > coroutines like qemu is cool. The only thing I afraid is the 
> > complicate of debug and it's really a big task :-(
> >
> > I agree with sage that this design is really a new implementation 
> > for objectstore so that it's harmful to existing objectstore impl. I 
> > also suffer the pain from sync read xattr, we may add a async read 
> > interface to solove this?
> >
> > For context switch thing, now we have at least 3 cs for one op at 
> > osd side. messenger -> op queue -> objectstore queue. I guess op 
> > queue -> objectstore is easier to kick off just as sam said. We can 
> > make write journal inline with queue_transaction, so the caller 
> > could directly handle the transaction right now.
> 
> I would caution agains coroutines (fibers) esp. in a multi-threaded 
> environment. Posix has officially obsoleted the swapcontext family of 
> functions in 1003.1-2004 and removed it in 1003.1-2008. That's because 
> they were notoriously non portable, and buggy. And yes you can use 
> something like boost::context / boost::coroutine instead but they also 
> have platform limitations. These implementations tend to abuse / turn 
> of various platform scrutiny features (like the one for 
> setjmp/longjmp). And on top of that many platforms don't consider 
> alternative context so you end up with obscure bugs. I've debugged my 
> fair share of bugs in Mordor coroutines with C++ exceptions, and errno 
> variables (since errno is really a function on linux and it's output a 
> pointer to threads errno is marked pure) if your coroutine migrates 
> threads. And you need to migrate them because of blocking and uneven 
> processor/thread distribution.
> 
> None of these are obstacles that can't be solved, but added together 
> they become a pretty long term liability. So I think long and hard 
> about it. Qemu doesn't have some of those issues because it's uses a 
> single thread and a much simpler C ABI that it deals with.
> 
> An alternative to coroutines that goes a long way towards solving the 
> callback spaghetti problem is futures/promises. I'm not talking of the 
> very future model that exists in C++11 library but more along the 
> lines that exist in other languages (like what's being done in 
> Javascript today). There's a good implementation of it Folly (the 
> facebook c++11 library). They have a very nice piece of documentation 
> here to understand how they work and how they differ.
> 
> That future model is very handy when dealing with the callback control 
> flow problem. You can chain a bunch of processing steps that requires 
> some async action, return a future and continue so on and so forth.
> Also, it makes handling complex error cases easy by giving you a way 
> to skip lots of processing steps strait to onError at the end of the 
> chain.
> 
> Take a look at folly. Take a look at the expanded boost futures (they 
> call this future continuations:
> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.h
> tml#thread.synchronization.futures.then
> ). Also, building a cut down future framework just for Ceph (or 
> reduced set folly one) might be another option.
> 
> Just an alternative.
> 
> >
> > Anyway, I think we need to do some changes for this field.
> >
> >> Yehuda
> >>
> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all 
> >>> tied to the actual interface I presented so much as the notion 
> >>> that the next thing to do is restructure the OpWQ users as async 
> >>> state machines.
> >>> -Sam
> >>>
> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
> >>>>> Currently, there are some deficiencies in how the OSD maps ops 
> >>>>> onto
> >>>>> threads:
> >>>>>
> >>>>> 1. Reads are always syncronous limiting the queue depth seen 
> >>>>> from the device
> >>>>>    and therefore the possible parallelism.
> >>>>> 2. Writes are always asyncronous forcing even very fast writes 
> >>>>> to be completed
> >>>>>    in a seperate thread.
> >>>>> 3. do_op cannot surrender the thread/pg lock during an operation 
> >>>>> forcing reads
> >>>>>    required to continue the operation to be syncronous.
> >>>>>
> >>>>> For spinning disks, this is mostly ok since they don't benefit 
> >>>>> as much from large read queues, and writes (filestore with 
> >>>>> journal) are too slow for the thread switches to make a big 
> >>>>> difference.  For very fast flash, however, we want the 
> >>>>> flexibility to allow the backend to perform writes syncronously 
> >>>>> or asyncronously when it makes sense, and to maintain a larger 
> >>>>> number of outstanding reads than we have threads.  To that end, 
> >>>>> I suggest changing the ObjectStore interface to be somewhat 
> >>>>> polling based:
> >>>>>
> >>>>> /// Create new token
> >>>>> void *create_operation_token() = 0; bool 
> >>>>> is_operation_complete(void *token) = 0; bool 
> >>>>> is_operation_committed(void *token) = 0; bool 
> >>>>> is_operation_applied(void *token) = 0; void 
> >>>>> wait_for_committed(void *token) = 0; void wait_for_applied(void 
> >>>>> *token) = 0; void wait_for_complete(void *token) = 0; /// Get 
> >>>>> result of operation int get_result(void *token) = 0; /// Must 
> >>>>> only be called once is_opearation_complete(token) void 
> >>>>> reset_operation_token(void *token) = 0; /// Must only be called 
> >>>>> once is_opearation_complete(token) void 
> >>>>> detroy_operation_token(void *token) = 0;
> >>>>>
> >>>>> /**
> >>>>>  * Queue a transaction
> >>>>>  *
> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>>  * If the operation is completed syncronously, token can be 
> >>>>> resused
> >>>>>  * without calling reset_operation_token.
> >>>>>  *
> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */ int 
> >>>>> queue_transaction(
> >>>>>   Transaction *t,
> >>>>>   OpSequencer *osr,
> >>>>>   void *token
> >>>>>   ) = 0;
> >>>>>
> >>>>> /**
> >>>>>  * Queue a transaction
> >>>>>  *
> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>>  * If the operation is completed syncronously, token can be 
> >>>>> resused
> >>>>>  * without calling reset_operation_token.
> >>>>>  *
> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
> >>>>>  */
> >>>>> int read(..., void *token) = 0;
> >>>>> ...
> >>>>>
> >>>>> The "token" concept here is opaque to allow the implementation 
> >>>>> some flexibility.  Ideally, it would be nice to be able to 
> >>>>> include libaio operation contexts directly.
> >>>>>
> >>>>> The main goal here is for the backend to have the freedom to 
> >>>>> complete writes and reads asyncronously or syncronously as the 
> >>>>> sitation warrants.
> >>>>> It also leaves the interface user in control of where the 
> >>>>> operation completion is handled.  Each op thread can therefore 
> >>>>> handle its own
> >>>>> completions:
> >>>>>
> >>>>> struct InProgressOp {
> >>>>>   PGRef pg;
> >>>>>   ObjectStore::Token *token;
> >>>>>   OpContext *ctx;
> >>>>> };
> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
> >>>>
> >>>> Probably a deque<> since we'll be pushign new requests and 
> >>>> slurping off completed ones?  Or, we can make token not 
> >>>> completely opaque, so that it includes a boost::intrusive::list 
> >>>> node and can be strung on a user-managed queue.
> >>>>
> >>>>> for (auto op : in_progress) {
> >>>>>   op.token = objectstore->create_operation_token();
> >>>>> }
> >>>>>
> >>>>> uint64_t next_to_start = 0;
> >>>>> uint64_t next_to_complete = 0;
> >>>>>
> >>>>> while (1) {
> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
> >>>>>     objectstore->wait_for_complete(op.token);
> >>>>>   }
> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
> >>>>>     if (objectstore->is_operation_complete(op.token)) {
> >>>>>       PGRef pg = op.pg;
> >>>>>       OpContext *ctx = op.ctx;
> >>>>>       op.pg = PGRef();
> >>>>>       op.ctx = nullptr;
> >>>>>       objectstore->reset_operation_token(op.token);
> >>>>>       if (pg->continue_op(
> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
> >>>>>               == -EAGAIN) {
> >>>>>         ++next_to_start;
> >>>>>         continue;
> >>>>>       }
> >>>>>     } else {
> >>>>>       break;
> >>>>>     }
> >>>>>   }
> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
> >>>>>   if (dq.second->do_op(
> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
> >>>>>           == -EAGAIN) {
> >>>>>     ++next_to_start;
> >>>>>   }
> >>>>> }
> >>>>>
> >>>>> A design like this would allow the op thread to move onto 
> >>>>> another task if the objectstore implementation wants to perform 
> >>>>> an async operation.  For this to work, there is some work to be 
> >>>>> done:
> >>>>>
> >>>>> 1. All current reads in the read and write paths (probably 
> >>>>> including the attr
> >>>>>    reads in get_object_context and friends) need to be able to handle
> >>>>>    getting
> >>>>>    -EAGAIN from the objectstore.
> >>>>
> >>>> Can we leave the old read methods in place as blocking versions, 
> >>>> and have them block on the token before returning?  That'll make 
> >>>> the transition less painful.
> >>>>
> >>>>> 2. Writes and reads need to be able to handle having the pg lock 
> >>>>> dropped
> >>>>>    during the operation.  This should be ok since the actual object
> >>>>>    information
> >>>>>    is protected by the RWState locks.
> >>>>
> >>>> All of the async write pieces already handle this (recheck PG 
> >>>> state after taking the lock).  If they don't get -EAGAIN they'd 
> >>>> just call the next stage, probably with a flag indicating that 
> >>>> validation can be skipped (since the lock hasn't been dropped)?
> >>>>
> >>>>> 3. OpContext needs to have enough information to pick up where 
> >>>>> the operation
> >>>>>    left off.  This suggests that we should obtain all required
> >>>>>    ObjectContexts
> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
> >>>>
> >>>> Yeah...
> >>>>
> >>>>> 4. The object class interface will need to be replaced with a 
> >>>>> new interface
> >>>>>    based on possibly async reads.  We can maintain compatibility with
> >>>>>    the
> >>>>>    current ones by launching a new thread to handle any message which
> >>>>>    happens
> >>>>>    to contain an old-style object class operation.
> >>>>
> >>>> Again, for now, wrappers would avoid this?
> >>>>
> >>>> s
> >>>>>
> >>>>> Most of this needs to happen to support object class operations 
> >>>>> on ec pools anyway.
> >>>>> -Sam
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in
> >>>>> the body of a message to majordomo@vger.kernel.org More 
> >>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe 
> >>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe 
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> > --
> > To unsubscribe from this list: send the line "unsubscribe 
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
> 
> p: 646-253-9055
> e: milosz@adfin.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-14 21:19           ` Matt Benjamin
  2015-08-14 21:39             ` Blinick, Stephen L
@ 2015-08-14 22:36             ` Milosz Tanski
  2015-08-27 23:21               ` Samuel Just
  1 sibling, 1 reply; 17+ messages in thread
From: Milosz Tanski @ 2015-08-14 22:36 UTC (permalink / raw)
  To: Matt Benjamin
  Cc: Haomai Wang, Yehuda Sadeh-Weinraub, Samuel Just, Sage Weil,
	ceph-devel

On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
> Hi,
>
> I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.

Not suggesting the libraries/frameworks. Just brining up promises as
an alternative technique to coroutines. Dealing with spaghetti
evented/callback code gets old after doing it for 10+ years. Then
throw in blocking IO.

And FYI, the data flow promises go back in comp sci back to the 80s.

Cheers,
- Milosz

>
> I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.
>
> My intuition is, since the goal is more deterministic performance in a short horizion, you
>
> a. need to prioritize transparency over novel abstractions
> b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline
>
> My .05.
>
> Matt
>
> --
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-761-4689
> fax.  734-769-8938
> cel.  734-216-5309
>
> ----- Original Message -----
>> From: "Milosz Tanski" <milosz@adfin.com>
>> To: "Haomai Wang" <haomaiwang@gmail.com>
>> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Samuel Just" <sjust@redhat.com>, "Sage Weil" <sage@newdream.net>,
>> ceph-devel@vger.kernel.org
>> Sent: Friday, August 14, 2015 4:56:26 PM
>> Subject: Re: Async reads, sync writes, op thread model discussion
>>
>> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
>> > <ysadehwe@redhat.com> wrote:
>> >> Already mentioned it on irc, adding to ceph-devel for the sake of
>> >> completeness. I did some infrastructure work for rgw and it seems (at
>> >> least to me) that it could at least be partially useful here.
>> >> Basically it's an async execution framework that utilizes coroutines.
>> >> It's comprised of aio notification manager that can also be tied into
>> >> coroutines execution. The coroutines themselves are stackless, they
>> >> are implemented as state machines, but using some boost trickery to
>> >> hide the details so they can be written very similar to blocking
>> >> methods. Coroutines can also execute other coroutines and can be
>> >> stacked, or can generate concurrent execution. It's still somewhat in
>> >> flux, but I think it's mostly done and already useful at this point,
>> >> so if there's anything you could use it might be a good idea to avoid
>> >> effort duplication.
>> >>
>> >
>> > coroutines like qemu is cool. The only thing I afraid is the
>> > complicate of debug and it's really a big task :-(
>> >
>> > I agree with sage that this design is really a new implementation for
>> > objectstore so that it's harmful to existing objectstore impl. I also
>> > suffer the pain from sync read xattr, we may add a async read
>> > interface to solove this?
>> >
>> > For context switch thing, now we have at least 3 cs for one op at osd
>> > side. messenger -> op queue -> objectstore queue. I guess op queue ->
>> > objectstore is easier to kick off just as sam said. We can make write
>> > journal inline with queue_transaction, so the caller could directly
>> > handle the transaction right now.
>>
>> I would caution agains coroutines (fibers) esp. in a multi-threaded
>> environment. Posix has officially obsoleted the swapcontext family of
>> functions in 1003.1-2004 and removed it in 1003.1-2008. That's because
>> they were notoriously non portable, and buggy. And yes you can use
>> something like boost::context / boost::coroutine instead but they also
>> have platform limitations. These implementations tend to abuse / turn
>> of various platform scrutiny features (like the one for
>> setjmp/longjmp). And on top of that many platforms don't consider
>> alternative context so you end up with obscure bugs. I've debugged my
>> fair share of bugs in Mordor coroutines with C++ exceptions, and errno
>> variables (since errno is really a function on linux and it's output a
>> pointer to threads errno is marked pure) if your coroutine migrates
>> threads. And you need to migrate them because of blocking and uneven
>> processor/thread distribution.
>>
>> None of these are obstacles that can't be solved, but added together
>> they become a pretty long term liability. So I think long and hard
>> about it. Qemu doesn't have some of those issues because it's uses a
>> single thread and a much simpler C ABI that it deals with.
>>
>> An alternative to coroutines that goes a long way towards solving the
>> callback spaghetti problem is futures/promises. I'm not talking of the
>> very future model that exists in C++11 library but more along the
>> lines that exist in other languages (like what's being done in
>> Javascript today). There's a good implementation of it Folly (the
>> facebook c++11 library). They have a very nice piece of documentation
>> here to understand how they work and how they differ.
>>
>> That future model is very handy when dealing with the callback control
>> flow problem. You can chain a bunch of processing steps that requires
>> some async action, return a future and continue so on and so forth.
>> Also, it makes handling complex error cases easy by giving you a way
>> to skip lots of processing steps strait to onError at the end of the
>> chain.
>>
>> Take a look at folly. Take a look at the expanded boost futures (they
>> call this future continuations:
>> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.futures.then
>> ). Also, building a cut down future framework just for Ceph (or
>> reduced set folly one) might be another option.
>>
>> Just an alternative.
>>
>> >
>> > Anyway, I think we need to do some changes for this field.
>> >
>> >> Yehuda
>> >>
>> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
>> >>> to the actual interface I presented so much as the notion that the
>> >>> next thing to do is restructure the OpWQ users as async state
>> >>> machines.
>> >>> -Sam
>> >>>
>> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>> >>>>> Currently, there are some deficiencies in how the OSD maps ops onto
>> >>>>> threads:
>> >>>>>
>> >>>>> 1. Reads are always syncronous limiting the queue depth seen from the
>> >>>>> device
>> >>>>>    and therefore the possible parallelism.
>> >>>>> 2. Writes are always asyncronous forcing even very fast writes to be
>> >>>>> completed
>> >>>>>    in a seperate thread.
>> >>>>> 3. do_op cannot surrender the thread/pg lock during an operation
>> >>>>> forcing reads
>> >>>>>    required to continue the operation to be syncronous.
>> >>>>>
>> >>>>> For spinning disks, this is mostly ok since they don't benefit as much
>> >>>>> from
>> >>>>> large read queues, and writes (filestore with journal) are too slow for
>> >>>>> the
>> >>>>> thread switches to make a big difference.  For very fast flash,
>> >>>>> however, we
>> >>>>> want the flexibility to allow the backend to perform writes
>> >>>>> syncronously or
>> >>>>> asyncronously when it makes sense, and to maintain a larger number of
>> >>>>> outstanding reads than we have threads.  To that end, I suggest
>> >>>>> changing the
>> >>>>> ObjectStore interface to be somewhat polling based:
>> >>>>>
>> >>>>> /// Create new token
>> >>>>> void *create_operation_token() = 0;
>> >>>>> bool is_operation_complete(void *token) = 0;
>> >>>>> bool is_operation_committed(void *token) = 0;
>> >>>>> bool is_operation_applied(void *token) = 0;
>> >>>>> void wait_for_committed(void *token) = 0;
>> >>>>> void wait_for_applied(void *token) = 0;
>> >>>>> void wait_for_complete(void *token) = 0;
>> >>>>> /// Get result of operation
>> >>>>> int get_result(void *token) = 0;
>> >>>>> /// Must only be called once is_opearation_complete(token)
>> >>>>> void reset_operation_token(void *token) = 0;
>> >>>>> /// Must only be called once is_opearation_complete(token)
>> >>>>> void detroy_operation_token(void *token) = 0;
>> >>>>>
>> >>>>> /**
>> >>>>>  * Queue a transaction
>> >>>>>  *
>> >>>>>  * token must be either fresh or reset since the last operation.
>> >>>>>  * If the operation is completed syncronously, token can be resused
>> >>>>>  * without calling reset_operation_token.
>> >>>>>  *
>> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async
>> >>>>>  */
>> >>>>> int queue_transaction(
>> >>>>>   Transaction *t,
>> >>>>>   OpSequencer *osr,
>> >>>>>   void *token
>> >>>>>   ) = 0;
>> >>>>>
>> >>>>> /**
>> >>>>>  * Queue a transaction
>> >>>>>  *
>> >>>>>  * token must be either fresh or reset since the last operation.
>> >>>>>  * If the operation is completed syncronously, token can be resused
>> >>>>>  * without calling reset_operation_token.
>> >>>>>  *
>> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>> >>>>>  */
>> >>>>> int read(..., void *token) = 0;
>> >>>>> ...
>> >>>>>
>> >>>>> The "token" concept here is opaque to allow the implementation some
>> >>>>> flexibility.  Ideally, it would be nice to be able to include libaio
>> >>>>> operation contexts directly.
>> >>>>>
>> >>>>> The main goal here is for the backend to have the freedom to complete
>> >>>>> writes and reads asyncronously or syncronously as the sitation
>> >>>>> warrants.
>> >>>>> It also leaves the interface user in control of where the operation
>> >>>>> completion is handled.  Each op thread can therefore handle its own
>> >>>>> completions:
>> >>>>>
>> >>>>> struct InProgressOp {
>> >>>>>   PGRef pg;
>> >>>>>   ObjectStore::Token *token;
>> >>>>>   OpContext *ctx;
>> >>>>> };
>> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>> >>>>
>> >>>> Probably a deque<> since we'll be pushign new requests and slurping off
>> >>>> completed ones?  Or, we can make token not completely opaque, so that it
>> >>>> includes a boost::intrusive::list node and can be strung on a
>> >>>> user-managed
>> >>>> queue.
>> >>>>
>> >>>>> for (auto op : in_progress) {
>> >>>>>   op.token = objectstore->create_operation_token();
>> >>>>> }
>> >>>>>
>> >>>>> uint64_t next_to_start = 0;
>> >>>>> uint64_t next_to_complete = 0;
>> >>>>>
>> >>>>> while (1) {
>> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>> >>>>>     objectstore->wait_for_complete(op.token);
>> >>>>>   }
>> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>> >>>>>     if (objectstore->is_operation_complete(op.token)) {
>> >>>>>       PGRef pg = op.pg;
>> >>>>>       OpContext *ctx = op.ctx;
>> >>>>>       op.pg = PGRef();
>> >>>>>       op.ctx = nullptr;
>> >>>>>       objectstore->reset_operation_token(op.token);
>> >>>>>       if (pg->continue_op(
>> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>> >>>>>               == -EAGAIN) {
>> >>>>>         ++next_to_start;
>> >>>>>         continue;
>> >>>>>       }
>> >>>>>     } else {
>> >>>>>       break;
>> >>>>>     }
>> >>>>>   }
>> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>> >>>>>   if (dq.second->do_op(
>> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>> >>>>>           == -EAGAIN) {
>> >>>>>     ++next_to_start;
>> >>>>>   }
>> >>>>> }
>> >>>>>
>> >>>>> A design like this would allow the op thread to move onto another task
>> >>>>> if the
>> >>>>> objectstore implementation wants to perform an async operation.  For
>> >>>>> this
>> >>>>> to work, there is some work to be done:
>> >>>>>
>> >>>>> 1. All current reads in the read and write paths (probably including
>> >>>>> the attr
>> >>>>>    reads in get_object_context and friends) need to be able to handle
>> >>>>>    getting
>> >>>>>    -EAGAIN from the objectstore.
>> >>>>
>> >>>> Can we leave the old read methods in place as blocking versions, and
>> >>>> have
>> >>>> them block on the token before returning?  That'll make the transition
>> >>>> less painful.
>> >>>>
>> >>>>> 2. Writes and reads need to be able to handle having the pg lock
>> >>>>> dropped
>> >>>>>    during the operation.  This should be ok since the actual object
>> >>>>>    information
>> >>>>>    is protected by the RWState locks.
>> >>>>
>> >>>> All of the async write pieces already handle this (recheck PG state
>> >>>> after
>> >>>> taking the lock).  If they don't get -EAGAIN they'd just call the next
>> >>>> stage, probably with a flag indicating that validation can be skipped
>> >>>> (since the lock hasn't been dropped)?
>> >>>>
>> >>>>> 3. OpContext needs to have enough information to pick up where the
>> >>>>> operation
>> >>>>>    left off.  This suggests that we should obtain all required
>> >>>>>    ObjectContexts
>> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>> >>>>
>> >>>> Yeah...
>> >>>>
>> >>>>> 4. The object class interface will need to be replaced with a new
>> >>>>> interface
>> >>>>>    based on possibly async reads.  We can maintain compatibility with
>> >>>>>    the
>> >>>>>    current ones by launching a new thread to handle any message which
>> >>>>>    happens
>> >>>>>    to contain an old-style object class operation.
>> >>>>
>> >>>> Again, for now, wrappers would avoid this?
>> >>>>
>> >>>> s
>> >>>>>
>> >>>>> Most of this needs to happen to support object class operations on ec
>> >>>>> pools
>> >>>>> anyway.
>> >>>>> -Sam
>> >>>>> --
>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >>>>> in
>> >>>>> the body of a message to majordomo@vger.kernel.org
>> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>
>> >>>>>
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>> the body of a message to majordomo@vger.kernel.org
>> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> >
>> > Wheat
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 16 East 34th Street, 15th floor
>> New York, NY 10016
>>
>> p: 646-253-9055
>> e: milosz@adfin.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-14 22:36             ` Milosz Tanski
@ 2015-08-27 23:21               ` Samuel Just
  2015-08-28 20:01                 ` Blinick, Stephen L
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Just @ 2015-08-27 23:21 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Matt Benjamin, Haomai Wang, Yehuda Sadeh-Weinraub, Sage Weil,
	ceph-devel

It's been a couple of weeks, so I thought I'd send out a short
progress update.  I've started by trying to nail down enough of the
threading design/async interface to start refactoring do_op.  For the
moment, I've backtracked on the token approach mostly because it
seemed more complicated than necessary.  I'm thinking we'll keep a
callback like mechanism, but move responsibility for queuing and
execution back to the interface user by allowing the user to pass a
completion queue and an uninterpreted completion pointer.  These two
commits have the gist of the direction I'm going in (the actual code
is more a place holder today).  An OSDReactor instance will replace
each of the "shards" in the current sharded work queue.  Any aio
initiated by a pg operation from a reactor will pass that reactor's
queue, ensuring that the completion winds up back in the same thread.
Writes would work pretty much the same way, but with two callbacks.

My plan is to flesh this out to the point where the OSD works again,
and then refactor the osd write path to use this mechanism for basic
rbd writes.  That should be enough to let us evaluate whether this is
a good path forward for async writes.  Async reads may be a bit tricky
to evaluate.  It seems like we'd need hardware that needs that kind of
queue depth and an objectstore implementation which can exploit it.
I'll wire up filestore to do async reads optionally for testing
purposes, but it's not clear to me that there will be cases where
filestore would want to do an async read rather than a sync read.

https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4
https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6
-Sam

On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski <milosz@adfin.com> wrote:
> On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
>> Hi,
>>
>> I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.
>
> Not suggesting the libraries/frameworks. Just brining up promises as
> an alternative technique to coroutines. Dealing with spaghetti
> evented/callback code gets old after doing it for 10+ years. Then
> throw in blocking IO.
>
> And FYI, the data flow promises go back in comp sci back to the 80s.
>
> Cheers,
> - Milosz
>
>>
>> I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.
>>
>> My intuition is, since the goal is more deterministic performance in a short horizion, you
>>
>> a. need to prioritize transparency over novel abstractions
>> b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline
>>
>> My .05.
>>
>> Matt
>>
>> --
>> Matt Benjamin
>> Red Hat, Inc.
>> 315 West Huron Street, Suite 140A
>> Ann Arbor, Michigan 48103
>>
>> http://www.redhat.com/en/technologies/storage
>>
>> tel.  734-761-4689
>> fax.  734-769-8938
>> cel.  734-216-5309
>>
>> ----- Original Message -----
>>> From: "Milosz Tanski" <milosz@adfin.com>
>>> To: "Haomai Wang" <haomaiwang@gmail.com>
>>> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Samuel Just" <sjust@redhat.com>, "Sage Weil" <sage@newdream.net>,
>>> ceph-devel@vger.kernel.org
>>> Sent: Friday, August 14, 2015 4:56:26 PM
>>> Subject: Re: Async reads, sync writes, op thread model discussion
>>>
>>> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>>> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
>>> > <ysadehwe@redhat.com> wrote:
>>> >> Already mentioned it on irc, adding to ceph-devel for the sake of
>>> >> completeness. I did some infrastructure work for rgw and it seems (at
>>> >> least to me) that it could at least be partially useful here.
>>> >> Basically it's an async execution framework that utilizes coroutines.
>>> >> It's comprised of aio notification manager that can also be tied into
>>> >> coroutines execution. The coroutines themselves are stackless, they
>>> >> are implemented as state machines, but using some boost trickery to
>>> >> hide the details so they can be written very similar to blocking
>>> >> methods. Coroutines can also execute other coroutines and can be
>>> >> stacked, or can generate concurrent execution. It's still somewhat in
>>> >> flux, but I think it's mostly done and already useful at this point,
>>> >> so if there's anything you could use it might be a good idea to avoid
>>> >> effort duplication.
>>> >>
>>> >
>>> > coroutines like qemu is cool. The only thing I afraid is the
>>> > complicate of debug and it's really a big task :-(
>>> >
>>> > I agree with sage that this design is really a new implementation for
>>> > objectstore so that it's harmful to existing objectstore impl. I also
>>> > suffer the pain from sync read xattr, we may add a async read
>>> > interface to solove this?
>>> >
>>> > For context switch thing, now we have at least 3 cs for one op at osd
>>> > side. messenger -> op queue -> objectstore queue. I guess op queue ->
>>> > objectstore is easier to kick off just as sam said. We can make write
>>> > journal inline with queue_transaction, so the caller could directly
>>> > handle the transaction right now.
>>>
>>> I would caution agains coroutines (fibers) esp. in a multi-threaded
>>> environment. Posix has officially obsoleted the swapcontext family of
>>> functions in 1003.1-2004 and removed it in 1003.1-2008. That's because
>>> they were notoriously non portable, and buggy. And yes you can use
>>> something like boost::context / boost::coroutine instead but they also
>>> have platform limitations. These implementations tend to abuse / turn
>>> of various platform scrutiny features (like the one for
>>> setjmp/longjmp). And on top of that many platforms don't consider
>>> alternative context so you end up with obscure bugs. I've debugged my
>>> fair share of bugs in Mordor coroutines with C++ exceptions, and errno
>>> variables (since errno is really a function on linux and it's output a
>>> pointer to threads errno is marked pure) if your coroutine migrates
>>> threads. And you need to migrate them because of blocking and uneven
>>> processor/thread distribution.
>>>
>>> None of these are obstacles that can't be solved, but added together
>>> they become a pretty long term liability. So I think long and hard
>>> about it. Qemu doesn't have some of those issues because it's uses a
>>> single thread and a much simpler C ABI that it deals with.
>>>
>>> An alternative to coroutines that goes a long way towards solving the
>>> callback spaghetti problem is futures/promises. I'm not talking of the
>>> very future model that exists in C++11 library but more along the
>>> lines that exist in other languages (like what's being done in
>>> Javascript today). There's a good implementation of it Folly (the
>>> facebook c++11 library). They have a very nice piece of documentation
>>> here to understand how they work and how they differ.
>>>
>>> That future model is very handy when dealing with the callback control
>>> flow problem. You can chain a bunch of processing steps that requires
>>> some async action, return a future and continue so on and so forth.
>>> Also, it makes handling complex error cases easy by giving you a way
>>> to skip lots of processing steps strait to onError at the end of the
>>> chain.
>>>
>>> Take a look at folly. Take a look at the expanded boost futures (they
>>> call this future continuations:
>>> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.futures.then
>>> ). Also, building a cut down future framework just for Ceph (or
>>> reduced set folly one) might be another option.
>>>
>>> Just an alternative.
>>>
>>> >
>>> > Anyway, I think we need to do some changes for this field.
>>> >
>>> >> Yehuda
>>> >>
>>> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>>> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
>>> >>> to the actual interface I presented so much as the notion that the
>>> >>> next thing to do is restructure the OpWQ users as async state
>>> >>> machines.
>>> >>> -Sam
>>> >>>
>>> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>>> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>> >>>>> Currently, there are some deficiencies in how the OSD maps ops onto
>>> >>>>> threads:
>>> >>>>>
>>> >>>>> 1. Reads are always syncronous limiting the queue depth seen from the
>>> >>>>> device
>>> >>>>>    and therefore the possible parallelism.
>>> >>>>> 2. Writes are always asyncronous forcing even very fast writes to be
>>> >>>>> completed
>>> >>>>>    in a seperate thread.
>>> >>>>> 3. do_op cannot surrender the thread/pg lock during an operation
>>> >>>>> forcing reads
>>> >>>>>    required to continue the operation to be syncronous.
>>> >>>>>
>>> >>>>> For spinning disks, this is mostly ok since they don't benefit as much
>>> >>>>> from
>>> >>>>> large read queues, and writes (filestore with journal) are too slow for
>>> >>>>> the
>>> >>>>> thread switches to make a big difference.  For very fast flash,
>>> >>>>> however, we
>>> >>>>> want the flexibility to allow the backend to perform writes
>>> >>>>> syncronously or
>>> >>>>> asyncronously when it makes sense, and to maintain a larger number of
>>> >>>>> outstanding reads than we have threads.  To that end, I suggest
>>> >>>>> changing the
>>> >>>>> ObjectStore interface to be somewhat polling based:
>>> >>>>>
>>> >>>>> /// Create new token
>>> >>>>> void *create_operation_token() = 0;
>>> >>>>> bool is_operation_complete(void *token) = 0;
>>> >>>>> bool is_operation_committed(void *token) = 0;
>>> >>>>> bool is_operation_applied(void *token) = 0;
>>> >>>>> void wait_for_committed(void *token) = 0;
>>> >>>>> void wait_for_applied(void *token) = 0;
>>> >>>>> void wait_for_complete(void *token) = 0;
>>> >>>>> /// Get result of operation
>>> >>>>> int get_result(void *token) = 0;
>>> >>>>> /// Must only be called once is_opearation_complete(token)
>>> >>>>> void reset_operation_token(void *token) = 0;
>>> >>>>> /// Must only be called once is_opearation_complete(token)
>>> >>>>> void detroy_operation_token(void *token) = 0;
>>> >>>>>
>>> >>>>> /**
>>> >>>>>  * Queue a transaction
>>> >>>>>  *
>>> >>>>>  * token must be either fresh or reset since the last operation.
>>> >>>>>  * If the operation is completed syncronously, token can be resused
>>> >>>>>  * without calling reset_operation_token.
>>> >>>>>  *
>>> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async
>>> >>>>>  */
>>> >>>>> int queue_transaction(
>>> >>>>>   Transaction *t,
>>> >>>>>   OpSequencer *osr,
>>> >>>>>   void *token
>>> >>>>>   ) = 0;
>>> >>>>>
>>> >>>>> /**
>>> >>>>>  * Queue a transaction
>>> >>>>>  *
>>> >>>>>  * token must be either fresh or reset since the last operation.
>>> >>>>>  * If the operation is completed syncronously, token can be resused
>>> >>>>>  * without calling reset_operation_token.
>>> >>>>>  *
>>> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>> >>>>>  */
>>> >>>>> int read(..., void *token) = 0;
>>> >>>>> ...
>>> >>>>>
>>> >>>>> The "token" concept here is opaque to allow the implementation some
>>> >>>>> flexibility.  Ideally, it would be nice to be able to include libaio
>>> >>>>> operation contexts directly.
>>> >>>>>
>>> >>>>> The main goal here is for the backend to have the freedom to complete
>>> >>>>> writes and reads asyncronously or syncronously as the sitation
>>> >>>>> warrants.
>>> >>>>> It also leaves the interface user in control of where the operation
>>> >>>>> completion is handled.  Each op thread can therefore handle its own
>>> >>>>> completions:
>>> >>>>>
>>> >>>>> struct InProgressOp {
>>> >>>>>   PGRef pg;
>>> >>>>>   ObjectStore::Token *token;
>>> >>>>>   OpContext *ctx;
>>> >>>>> };
>>> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>> >>>>
>>> >>>> Probably a deque<> since we'll be pushign new requests and slurping off
>>> >>>> completed ones?  Or, we can make token not completely opaque, so that it
>>> >>>> includes a boost::intrusive::list node and can be strung on a
>>> >>>> user-managed
>>> >>>> queue.
>>> >>>>
>>> >>>>> for (auto op : in_progress) {
>>> >>>>>   op.token = objectstore->create_operation_token();
>>> >>>>> }
>>> >>>>>
>>> >>>>> uint64_t next_to_start = 0;
>>> >>>>> uint64_t next_to_complete = 0;
>>> >>>>>
>>> >>>>> while (1) {
>>> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>> >>>>>     objectstore->wait_for_complete(op.token);
>>> >>>>>   }
>>> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>> >>>>>     if (objectstore->is_operation_complete(op.token)) {
>>> >>>>>       PGRef pg = op.pg;
>>> >>>>>       OpContext *ctx = op.ctx;
>>> >>>>>       op.pg = PGRef();
>>> >>>>>       op.ctx = nullptr;
>>> >>>>>       objectstore->reset_operation_token(op.token);
>>> >>>>>       if (pg->continue_op(
>>> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>> >>>>>               == -EAGAIN) {
>>> >>>>>         ++next_to_start;
>>> >>>>>         continue;
>>> >>>>>       }
>>> >>>>>     } else {
>>> >>>>>       break;
>>> >>>>>     }
>>> >>>>>   }
>>> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>> >>>>>   if (dq.second->do_op(
>>> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>> >>>>>           == -EAGAIN) {
>>> >>>>>     ++next_to_start;
>>> >>>>>   }
>>> >>>>> }
>>> >>>>>
>>> >>>>> A design like this would allow the op thread to move onto another task
>>> >>>>> if the
>>> >>>>> objectstore implementation wants to perform an async operation.  For
>>> >>>>> this
>>> >>>>> to work, there is some work to be done:
>>> >>>>>
>>> >>>>> 1. All current reads in the read and write paths (probably including
>>> >>>>> the attr
>>> >>>>>    reads in get_object_context and friends) need to be able to handle
>>> >>>>>    getting
>>> >>>>>    -EAGAIN from the objectstore.
>>> >>>>
>>> >>>> Can we leave the old read methods in place as blocking versions, and
>>> >>>> have
>>> >>>> them block on the token before returning?  That'll make the transition
>>> >>>> less painful.
>>> >>>>
>>> >>>>> 2. Writes and reads need to be able to handle having the pg lock
>>> >>>>> dropped
>>> >>>>>    during the operation.  This should be ok since the actual object
>>> >>>>>    information
>>> >>>>>    is protected by the RWState locks.
>>> >>>>
>>> >>>> All of the async write pieces already handle this (recheck PG state
>>> >>>> after
>>> >>>> taking the lock).  If they don't get -EAGAIN they'd just call the next
>>> >>>> stage, probably with a flag indicating that validation can be skipped
>>> >>>> (since the lock hasn't been dropped)?
>>> >>>>
>>> >>>>> 3. OpContext needs to have enough information to pick up where the
>>> >>>>> operation
>>> >>>>>    left off.  This suggests that we should obtain all required
>>> >>>>>    ObjectContexts
>>> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>> >>>>
>>> >>>> Yeah...
>>> >>>>
>>> >>>>> 4. The object class interface will need to be replaced with a new
>>> >>>>> interface
>>> >>>>>    based on possibly async reads.  We can maintain compatibility with
>>> >>>>>    the
>>> >>>>>    current ones by launching a new thread to handle any message which
>>> >>>>>    happens
>>> >>>>>    to contain an old-style object class operation.
>>> >>>>
>>> >>>> Again, for now, wrappers would avoid this?
>>> >>>>
>>> >>>> s
>>> >>>>>
>>> >>>>> Most of this needs to happen to support object class operations on ec
>>> >>>>> pools
>>> >>>>> anyway.
>>> >>>>> -Sam
>>> >>>>> --
>>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >>>>> in
>>> >>>>> the body of a message to majordomo@vger.kernel.org
>>> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>>>
>>> >>>>>
>>> >>> --
>>> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >>> the body of a message to majordomo@vger.kernel.org
>>> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> the body of a message to majordomo@vger.kernel.org
>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards,
>>> >
>>> > Wheat
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Milosz Tanski
>>> CTO
>>> 16 East 34th Street, 15th floor
>>> New York, NY 10016
>>>
>>> p: 646-253-9055
>>> e: milosz@adfin.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
>
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Async reads, sync writes, op thread model discussion
  2015-08-27 23:21               ` Samuel Just
@ 2015-08-28 20:01                 ` Blinick, Stephen L
  2015-08-28 21:25                   ` Samuel Just
  0 siblings, 1 reply; 17+ messages in thread
From: Blinick, Stephen L @ 2015-08-28 20:01 UTC (permalink / raw)
  To: Samuel Just, Milosz Tanski
  Cc: Matt Benjamin, Haomai Wang, Yehuda Sadeh-Weinraub, Sage Weil,
	ceph-devel

This sounds ok, with the synchronous interface still possible to the ObjectStore based on return code.   

I'd think that the async read interface can be evaluated with any hardware, at least for correctness, by observing the queue depth to the device during a test run.  Also, I think asynchronous reads may benefit various types of NAND SSD's as they do better with more parallelism and I typically see very low queuedepth to them today with Filestore (one of the reasons I think doubling up OSD's on a single flash device helps benchmarks). 

Thanks,

Stephen



-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Samuel Just
Sent: Thursday, August 27, 2015 4:22 PM
To: Milosz Tanski
Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil; ceph-devel
Subject: Re: Async reads, sync writes, op thread model discussion

It's been a couple of weeks, so I thought I'd send out a short progress update.  I've started by trying to nail down enough of the threading design/async interface to start refactoring do_op.  For the moment, I've backtracked on the token approach mostly because it seemed more complicated than necessary.  I'm thinking we'll keep a callback like mechanism, but move responsibility for queuing and execution back to the interface user by allowing the user to pass a completion queue and an uninterpreted completion pointer.  These two commits have the gist of the direction I'm going in (the actual code is more a place holder today).  An OSDReactor instance will replace each of the "shards" in the current sharded work queue.  Any aio initiated by a pg operation from a reactor will pass that reactor's queue, ensuring that the completion winds up back in the same thread.
Writes would work pretty much the same way, but with two callbacks.

My plan is to flesh this out to the point where the OSD works again, and then refactor the osd write path to use this mechanism for basic rbd writes.  That should be enough to let us evaluate whether this is a good path forward for async writes.  Async reads may be a bit tricky to evaluate.  It seems like we'd need hardware that needs that kind of queue depth and an objectstore implementation which can exploit it.
I'll wire up filestore to do async reads optionally for testing purposes, but it's not clear to me that there will be cases where filestore would want to do an async read rather than a sync read.

https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4
https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6
-Sam

On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski <milosz@adfin.com> wrote:
> On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
>> Hi,
>>
>> I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.
>
> Not suggesting the libraries/frameworks. Just brining up promises as 
> an alternative technique to coroutines. Dealing with spaghetti 
> evented/callback code gets old after doing it for 10+ years. Then 
> throw in blocking IO.
>
> And FYI, the data flow promises go back in comp sci back to the 80s.
>
> Cheers,
> - Milosz
>
>>
>> I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.
>>
>> My intuition is, since the goal is more deterministic performance in 
>> a short horizion, you
>>
>> a. need to prioritize transparency over novel abstractions b. need to 
>> build solid microbenchmarks that encapsulate small, then larger 
>> pieces of the work pipeline
>>
>> My .05.
>>
>> Matt
>>
>> --
>> Matt Benjamin
>> Red Hat, Inc.
>> 315 West Huron Street, Suite 140A
>> Ann Arbor, Michigan 48103
>>
>> http://www.redhat.com/en/technologies/storage
>>
>> tel.  734-761-4689
>> fax.  734-769-8938
>> cel.  734-216-5309
>>
>> ----- Original Message -----
>>> From: "Milosz Tanski" <milosz@adfin.com>
>>> To: "Haomai Wang" <haomaiwang@gmail.com>
>>> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Samuel Just" 
>>> <sjust@redhat.com>, "Sage Weil" <sage@newdream.net>, 
>>> ceph-devel@vger.kernel.org
>>> Sent: Friday, August 14, 2015 4:56:26 PM
>>> Subject: Re: Async reads, sync writes, op thread model discussion
>>>
>>> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>>> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub 
>>> > <ysadehwe@redhat.com> wrote:
>>> >> Already mentioned it on irc, adding to ceph-devel for the sake of 
>>> >> completeness. I did some infrastructure work for rgw and it seems 
>>> >> (at least to me) that it could at least be partially useful here.
>>> >> Basically it's an async execution framework that utilizes coroutines.
>>> >> It's comprised of aio notification manager that can also be tied 
>>> >> into coroutines execution. The coroutines themselves are 
>>> >> stackless, they are implemented as state machines, but using some 
>>> >> boost trickery to hide the details so they can be written very 
>>> >> similar to blocking methods. Coroutines can also execute other 
>>> >> coroutines and can be stacked, or can generate concurrent 
>>> >> execution. It's still somewhat in flux, but I think it's mostly 
>>> >> done and already useful at this point, so if there's anything you 
>>> >> could use it might be a good idea to avoid effort duplication.
>>> >>
>>> >
>>> > coroutines like qemu is cool. The only thing I afraid is the 
>>> > complicate of debug and it's really a big task :-(
>>> >
>>> > I agree with sage that this design is really a new implementation 
>>> > for objectstore so that it's harmful to existing objectstore impl. 
>>> > I also suffer the pain from sync read xattr, we may add a async 
>>> > read interface to solove this?
>>> >
>>> > For context switch thing, now we have at least 3 cs for one op at 
>>> > osd side. messenger -> op queue -> objectstore queue. I guess op 
>>> > queue -> objectstore is easier to kick off just as sam said. We 
>>> > can make write journal inline with queue_transaction, so the 
>>> > caller could directly handle the transaction right now.
>>>
>>> I would caution agains coroutines (fibers) esp. in a multi-threaded 
>>> environment. Posix has officially obsoleted the swapcontext family 
>>> of functions in 1003.1-2004 and removed it in 1003.1-2008. That's 
>>> because they were notoriously non portable, and buggy. And yes you 
>>> can use something like boost::context / boost::coroutine instead but 
>>> they also have platform limitations. These implementations tend to 
>>> abuse / turn of various platform scrutiny features (like the one for 
>>> setjmp/longjmp). And on top of that many platforms don't consider 
>>> alternative context so you end up with obscure bugs. I've debugged 
>>> my fair share of bugs in Mordor coroutines with C++ exceptions, and 
>>> errno variables (since errno is really a function on linux and it's 
>>> output a pointer to threads errno is marked pure) if your coroutine 
>>> migrates threads. And you need to migrate them because of blocking 
>>> and uneven processor/thread distribution.
>>>
>>> None of these are obstacles that can't be solved, but added together 
>>> they become a pretty long term liability. So I think long and hard 
>>> about it. Qemu doesn't have some of those issues because it's uses a 
>>> single thread and a much simpler C ABI that it deals with.
>>>
>>> An alternative to coroutines that goes a long way towards solving 
>>> the callback spaghetti problem is futures/promises. I'm not talking 
>>> of the very future model that exists in C++11 library but more along 
>>> the lines that exist in other languages (like what's being done in 
>>> Javascript today). There's a good implementation of it Folly (the 
>>> facebook c++11 library). They have a very nice piece of 
>>> documentation here to understand how they work and how they differ.
>>>
>>> That future model is very handy when dealing with the callback 
>>> control flow problem. You can chain a bunch of processing steps that 
>>> requires some async action, return a future and continue so on and so forth.
>>> Also, it makes handling complex error cases easy by giving you a way 
>>> to skip lots of processing steps strait to onError at the end of the 
>>> chain.
>>>
>>> Take a look at folly. Take a look at the expanded boost futures 
>>> (they call this future continuations:
>>> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization
>>> .html#thread.synchronization.futures.then
>>> ). Also, building a cut down future framework just for Ceph (or 
>>> reduced set folly one) might be another option.
>>>
>>> Just an alternative.
>>>
>>> >
>>> > Anyway, I think we need to do some changes for this field.
>>> >
>>> >> Yehuda
>>> >>
>>> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>>> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all 
>>> >>> tied to the actual interface I presented so much as the notion 
>>> >>> that the next thing to do is restructure the OpWQ users as async 
>>> >>> state machines.
>>> >>> -Sam
>>> >>>
>>> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>>> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>> >>>>> Currently, there are some deficiencies in how the OSD maps ops 
>>> >>>>> onto
>>> >>>>> threads:
>>> >>>>>
>>> >>>>> 1. Reads are always syncronous limiting the queue depth seen 
>>> >>>>> from the device
>>> >>>>>    and therefore the possible parallelism.
>>> >>>>> 2. Writes are always asyncronous forcing even very fast writes 
>>> >>>>> to be completed
>>> >>>>>    in a seperate thread.
>>> >>>>> 3. do_op cannot surrender the thread/pg lock during an 
>>> >>>>> operation forcing reads
>>> >>>>>    required to continue the operation to be syncronous.
>>> >>>>>
>>> >>>>> For spinning disks, this is mostly ok since they don't benefit 
>>> >>>>> as much from large read queues, and writes (filestore with 
>>> >>>>> journal) are too slow for the thread switches to make a big 
>>> >>>>> difference.  For very fast flash, however, we want the 
>>> >>>>> flexibility to allow the backend to perform writes 
>>> >>>>> syncronously or asyncronously when it makes sense, and to 
>>> >>>>> maintain a larger number of outstanding reads than we have 
>>> >>>>> threads.  To that end, I suggest changing the ObjectStore 
>>> >>>>> interface to be somewhat polling based:
>>> >>>>>
>>> >>>>> /// Create new token
>>> >>>>> void *create_operation_token() = 0; bool 
>>> >>>>> is_operation_complete(void *token) = 0; bool 
>>> >>>>> is_operation_committed(void *token) = 0; bool 
>>> >>>>> is_operation_applied(void *token) = 0; void 
>>> >>>>> wait_for_committed(void *token) = 0; void 
>>> >>>>> wait_for_applied(void *token) = 0; void wait_for_complete(void 
>>> >>>>> *token) = 0; /// Get result of operation int get_result(void 
>>> >>>>> *token) = 0; /// Must only be called once 
>>> >>>>> is_opearation_complete(token) void reset_operation_token(void 
>>> >>>>> *token) = 0; /// Must only be called once 
>>> >>>>> is_opearation_complete(token) void detroy_operation_token(void 
>>> >>>>> *token) = 0;
>>> >>>>>
>>> >>>>> /**
>>> >>>>>  * Queue a transaction
>>> >>>>>  *
>>> >>>>>  * token must be either fresh or reset since the last operation.
>>> >>>>>  * If the operation is completed syncronously, token can be 
>>> >>>>> resused
>>> >>>>>  * without calling reset_operation_token.
>>> >>>>>  *
>>> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */ 
>>> >>>>> int queue_transaction(
>>> >>>>>   Transaction *t,
>>> >>>>>   OpSequencer *osr,
>>> >>>>>   void *token
>>> >>>>>   ) = 0;
>>> >>>>>
>>> >>>>> /**
>>> >>>>>  * Queue a transaction
>>> >>>>>  *
>>> >>>>>  * token must be either fresh or reset since the last operation.
>>> >>>>>  * If the operation is completed syncronously, token can be 
>>> >>>>> resused
>>> >>>>>  * without calling reset_operation_token.
>>> >>>>>  *
>>> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>> >>>>>  */
>>> >>>>> int read(..., void *token) = 0; ...
>>> >>>>>
>>> >>>>> The "token" concept here is opaque to allow the implementation 
>>> >>>>> some flexibility.  Ideally, it would be nice to be able to 
>>> >>>>> include libaio operation contexts directly.
>>> >>>>>
>>> >>>>> The main goal here is for the backend to have the freedom to 
>>> >>>>> complete writes and reads asyncronously or syncronously as the 
>>> >>>>> sitation warrants.
>>> >>>>> It also leaves the interface user in control of where the 
>>> >>>>> operation completion is handled.  Each op thread can therefore 
>>> >>>>> handle its own
>>> >>>>> completions:
>>> >>>>>
>>> >>>>> struct InProgressOp {
>>> >>>>>   PGRef pg;
>>> >>>>>   ObjectStore::Token *token;
>>> >>>>>   OpContext *ctx;
>>> >>>>> };
>>> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>> >>>>
>>> >>>> Probably a deque<> since we'll be pushign new requests and 
>>> >>>> slurping off completed ones?  Or, we can make token not 
>>> >>>> completely opaque, so that it includes a boost::intrusive::list 
>>> >>>> node and can be strung on a user-managed queue.
>>> >>>>
>>> >>>>> for (auto op : in_progress) {
>>> >>>>>   op.token = objectstore->create_operation_token();
>>> >>>>> }
>>> >>>>>
>>> >>>>> uint64_t next_to_start = 0;
>>> >>>>> uint64_t next_to_complete = 0;
>>> >>>>>
>>> >>>>> while (1) {
>>> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>> >>>>>     objectstore->wait_for_complete(op.token);
>>> >>>>>   }
>>> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>> >>>>>     if (objectstore->is_operation_complete(op.token)) {
>>> >>>>>       PGRef pg = op.pg;
>>> >>>>>       OpContext *ctx = op.ctx;
>>> >>>>>       op.pg = PGRef();
>>> >>>>>       op.ctx = nullptr;
>>> >>>>>       objectstore->reset_operation_token(op.token);
>>> >>>>>       if (pg->continue_op(
>>> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>> >>>>>               == -EAGAIN) {
>>> >>>>>         ++next_to_start;
>>> >>>>>         continue;
>>> >>>>>       }
>>> >>>>>     } else {
>>> >>>>>       break;
>>> >>>>>     }
>>> >>>>>   }
>>> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>> >>>>>   if (dq.second->do_op(
>>> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>> >>>>>           == -EAGAIN) {
>>> >>>>>     ++next_to_start;
>>> >>>>>   }
>>> >>>>> }
>>> >>>>>
>>> >>>>> A design like this would allow the op thread to move onto 
>>> >>>>> another task if the objectstore implementation wants to 
>>> >>>>> perform an async operation.  For this to work, there is some 
>>> >>>>> work to be done:
>>> >>>>>
>>> >>>>> 1. All current reads in the read and write paths (probably 
>>> >>>>> including the attr
>>> >>>>>    reads in get_object_context and friends) need to be able to handle
>>> >>>>>    getting
>>> >>>>>    -EAGAIN from the objectstore.
>>> >>>>
>>> >>>> Can we leave the old read methods in place as blocking 
>>> >>>> versions, and have them block on the token before returning?  
>>> >>>> That'll make the transition less painful.
>>> >>>>
>>> >>>>> 2. Writes and reads need to be able to handle having the pg 
>>> >>>>> lock dropped
>>> >>>>>    during the operation.  This should be ok since the actual object
>>> >>>>>    information
>>> >>>>>    is protected by the RWState locks.
>>> >>>>
>>> >>>> All of the async write pieces already handle this (recheck PG 
>>> >>>> state after taking the lock).  If they don't get -EAGAIN they'd 
>>> >>>> just call the next stage, probably with a flag indicating that 
>>> >>>> validation can be skipped (since the lock hasn't been dropped)?
>>> >>>>
>>> >>>>> 3. OpContext needs to have enough information to pick up where 
>>> >>>>> the operation
>>> >>>>>    left off.  This suggests that we should obtain all required
>>> >>>>>    ObjectContexts
>>> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>> >>>>
>>> >>>> Yeah...
>>> >>>>
>>> >>>>> 4. The object class interface will need to be replaced with a 
>>> >>>>> new interface
>>> >>>>>    based on possibly async reads.  We can maintain compatibility with
>>> >>>>>    the
>>> >>>>>    current ones by launching a new thread to handle any message which
>>> >>>>>    happens
>>> >>>>>    to contain an old-style object class operation.
>>> >>>>
>>> >>>> Again, for now, wrappers would avoid this?
>>> >>>>
>>> >>>> s
>>> >>>>>
>>> >>>>> Most of this needs to happen to support object class 
>>> >>>>> operations on ec pools anyway.
>>> >>>>> -Sam
>>> >>>>> --
>>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >>>>> in
>>> >>>>> the body of a message to majordomo@vger.kernel.org More 
>>> >>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >>>>>
>>> >>>>>
>>> >>> --
>>> >>> To unsubscribe from this list: send the line "unsubscribe 
>>> >>> ceph-devel" in the body of a message to 
>>> >>> majordomo@vger.kernel.org More majordomo info at  
>>> >>> http://vger.kernel.org/majordomo-info.html
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe 
>>> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>> >> More majordomo info at  
>>> >> http://vger.kernel.org/majordomo-info.html
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards,
>>> >
>>> > Wheat
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe 
>>> > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Milosz Tanski
>>> CTO
>>> 16 East 34th Street, 15th floor
>>> New York, NY 10016
>>>
>>> p: 646-253-9055
>>> e: milosz@adfin.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
>
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-28 20:01                 ` Blinick, Stephen L
@ 2015-08-28 21:25                   ` Samuel Just
  2015-10-28 20:15                     ` Samuel Just
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Just @ 2015-08-28 21:25 UTC (permalink / raw)
  To: Blinick, Stephen L
  Cc: Milosz Tanski, Matt Benjamin, Haomai Wang, Yehuda Sadeh-Weinraub,
	Sage Weil, ceph-devel

Oh, yeah, we'll definitely test for correctness for async reads on
filestore, I'm just worried about validating the performance
assumptions.  The 3700s might be just fine for that validation though.
-Sam

On Fri, Aug 28, 2015 at 1:01 PM, Blinick, Stephen L
<stephen.l.blinick@intel.com> wrote:
> This sounds ok, with the synchronous interface still possible to the ObjectStore based on return code.
>
> I'd think that the async read interface can be evaluated with any hardware, at least for correctness, by observing the queue depth to the device during a test run.  Also, I think asynchronous reads may benefit various types of NAND SSD's as they do better with more parallelism and I typically see very low queuedepth to them today with Filestore (one of the reasons I think doubling up OSD's on a single flash device helps benchmarks).
>
> Thanks,
>
> Stephen
>
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Samuel Just
> Sent: Thursday, August 27, 2015 4:22 PM
> To: Milosz Tanski
> Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil; ceph-devel
> Subject: Re: Async reads, sync writes, op thread model discussion
>
> It's been a couple of weeks, so I thought I'd send out a short progress update.  I've started by trying to nail down enough of the threading design/async interface to start refactoring do_op.  For the moment, I've backtracked on the token approach mostly because it seemed more complicated than necessary.  I'm thinking we'll keep a callback like mechanism, but move responsibility for queuing and execution back to the interface user by allowing the user to pass a completion queue and an uninterpreted completion pointer.  These two commits have the gist of the direction I'm going in (the actual code is more a place holder today).  An OSDReactor instance will replace each of the "shards" in the current sharded work queue.  Any aio initiated by a pg operation from a reactor will pass that rea
 ctor's queue, ensuring that the completion winds up back in the same thread.
> Writes would work pretty much the same way, but with two callbacks.
>
> My plan is to flesh this out to the point where the OSD works again, and then refactor the osd write path to use this mechanism for basic rbd writes.  That should be enough to let us evaluate whether this is a good path forward for async writes.  Async reads may be a bit tricky to evaluate.  It seems like we'd need hardware that needs that kind of queue depth and an objectstore implementation which can exploit it.
> I'll wire up filestore to do async reads optionally for testing purposes, but it's not clear to me that there will be cases where filestore would want to do an async read rather than a sync read.
>
> https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4
> https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6
> -Sam
>
> On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski <milosz@adfin.com> wrote:
>> On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
>>> Hi,
>>>
>>> I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.
>>
>> Not suggesting the libraries/frameworks. Just brining up promises as
>> an alternative technique to coroutines. Dealing with spaghetti
>> evented/callback code gets old after doing it for 10+ years. Then
>> throw in blocking IO.
>>
>> And FYI, the data flow promises go back in comp sci back to the 80s.
>>
>> Cheers,
>> - Milosz
>>
>>>
>>> I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.
>>>
>>> My intuition is, since the goal is more deterministic performance in
>>> a short horizion, you
>>>
>>> a. need to prioritize transparency over novel abstractions b. need to
>>> build solid microbenchmarks that encapsulate small, then larger
>>> pieces of the work pipeline
>>>
>>> My .05.
>>>
>>> Matt
>>>
>>> --
>>> Matt Benjamin
>>> Red Hat, Inc.
>>> 315 West Huron Street, Suite 140A
>>> Ann Arbor, Michigan 48103
>>>
>>> http://www.redhat.com/en/technologies/storage
>>>
>>> tel.  734-761-4689
>>> fax.  734-769-8938
>>> cel.  734-216-5309
>>>
>>> ----- Original Message -----
>>>> From: "Milosz Tanski" <milosz@adfin.com>
>>>> To: "Haomai Wang" <haomaiwang@gmail.com>
>>>> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Samuel Just"
>>>> <sjust@redhat.com>, "Sage Weil" <sage@newdream.net>,
>>>> ceph-devel@vger.kernel.org
>>>> Sent: Friday, August 14, 2015 4:56:26 PM
>>>> Subject: Re: Async reads, sync writes, op thread model discussion
>>>>
>>>> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>>>> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
>>>> > <ysadehwe@redhat.com> wrote:
>>>> >> Already mentioned it on irc, adding to ceph-devel for the sake of
>>>> >> completeness. I did some infrastructure work for rgw and it seems
>>>> >> (at least to me) that it could at least be partially useful here.
>>>> >> Basically it's an async execution framework that utilizes coroutines.
>>>> >> It's comprised of aio notification manager that can also be tied
>>>> >> into coroutines execution. The coroutines themselves are
>>>> >> stackless, they are implemented as state machines, but using some
>>>> >> boost trickery to hide the details so they can be written very
>>>> >> similar to blocking methods. Coroutines can also execute other
>>>> >> coroutines and can be stacked, or can generate concurrent
>>>> >> execution. It's still somewhat in flux, but I think it's mostly
>>>> >> done and already useful at this point, so if there's anything you
>>>> >> could use it might be a good idea to avoid effort duplication.
>>>> >>
>>>> >
>>>> > coroutines like qemu is cool. The only thing I afraid is the
>>>> > complicate of debug and it's really a big task :-(
>>>> >
>>>> > I agree with sage that this design is really a new implementation
>>>> > for objectstore so that it's harmful to existing objectstore impl.
>>>> > I also suffer the pain from sync read xattr, we may add a async
>>>> > read interface to solove this?
>>>> >
>>>> > For context switch thing, now we have at least 3 cs for one op at
>>>> > osd side. messenger -> op queue -> objectstore queue. I guess op
>>>> > queue -> objectstore is easier to kick off just as sam said. We
>>>> > can make write journal inline with queue_transaction, so the
>>>> > caller could directly handle the transaction right now.
>>>>
>>>> I would caution agains coroutines (fibers) esp. in a multi-threaded
>>>> environment. Posix has officially obsoleted the swapcontext family
>>>> of functions in 1003.1-2004 and removed it in 1003.1-2008. That's
>>>> because they were notoriously non portable, and buggy. And yes you
>>>> can use something like boost::context / boost::coroutine instead but
>>>> they also have platform limitations. These implementations tend to
>>>> abuse / turn of various platform scrutiny features (like the one for
>>>> setjmp/longjmp). And on top of that many platforms don't consider
>>>> alternative context so you end up with obscure bugs. I've debugged
>>>> my fair share of bugs in Mordor coroutines with C++ exceptions, and
>>>> errno variables (since errno is really a function on linux and it's
>>>> output a pointer to threads errno is marked pure) if your coroutine
>>>> migrates threads. And you need to migrate them because of blocking
>>>> and uneven processor/thread distribution.
>>>>
>>>> None of these are obstacles that can't be solved, but added together
>>>> they become a pretty long term liability. So I think long and hard
>>>> about it. Qemu doesn't have some of those issues because it's uses a
>>>> single thread and a much simpler C ABI that it deals with.
>>>>
>>>> An alternative to coroutines that goes a long way towards solving
>>>> the callback spaghetti problem is futures/promises. I'm not talking
>>>> of the very future model that exists in C++11 library but more along
>>>> the lines that exist in other languages (like what's being done in
>>>> Javascript today). There's a good implementation of it Folly (the
>>>> facebook c++11 library). They have a very nice piece of
>>>> documentation here to understand how they work and how they differ.
>>>>
>>>> That future model is very handy when dealing with the callback
>>>> control flow problem. You can chain a bunch of processing steps that
>>>> requires some async action, return a future and continue so on and so forth.
>>>> Also, it makes handling complex error cases easy by giving you a way
>>>> to skip lots of processing steps strait to onError at the end of the
>>>> chain.
>>>>
>>>> Take a look at folly. Take a look at the expanded boost futures
>>>> (they call this future continuations:
>>>> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization
>>>> .html#thread.synchronization.futures.then
>>>> ). Also, building a cut down future framework just for Ceph (or
>>>> reduced set folly one) might be another option.
>>>>
>>>> Just an alternative.
>>>>
>>>> >
>>>> > Anyway, I think we need to do some changes for this field.
>>>> >
>>>> >> Yehuda
>>>> >>
>>>> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>>>> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all
>>>> >>> tied to the actual interface I presented so much as the notion
>>>> >>> that the next thing to do is restructure the OpWQ users as async
>>>> >>> state machines.
>>>> >>> -Sam
>>>> >>>
>>>> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>>>> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>>> >>>>> Currently, there are some deficiencies in how the OSD maps ops
>>>> >>>>> onto
>>>> >>>>> threads:
>>>> >>>>>
>>>> >>>>> 1. Reads are always syncronous limiting the queue depth seen
>>>> >>>>> from the device
>>>> >>>>>    and therefore the possible parallelism.
>>>> >>>>> 2. Writes are always asyncronous forcing even very fast writes
>>>> >>>>> to be completed
>>>> >>>>>    in a seperate thread.
>>>> >>>>> 3. do_op cannot surrender the thread/pg lock during an
>>>> >>>>> operation forcing reads
>>>> >>>>>    required to continue the operation to be syncronous.
>>>> >>>>>
>>>> >>>>> For spinning disks, this is mostly ok since they don't benefit
>>>> >>>>> as much from large read queues, and writes (filestore with
>>>> >>>>> journal) are too slow for the thread switches to make a big
>>>> >>>>> difference.  For very fast flash, however, we want the
>>>> >>>>> flexibility to allow the backend to perform writes
>>>> >>>>> syncronously or asyncronously when it makes sense, and to
>>>> >>>>> maintain a larger number of outstanding reads than we have
>>>> >>>>> threads.  To that end, I suggest changing the ObjectStore
>>>> >>>>> interface to be somewhat polling based:
>>>> >>>>>
>>>> >>>>> /// Create new token
>>>> >>>>> void *create_operation_token() = 0; bool
>>>> >>>>> is_operation_complete(void *token) = 0; bool
>>>> >>>>> is_operation_committed(void *token) = 0; bool
>>>> >>>>> is_operation_applied(void *token) = 0; void
>>>> >>>>> wait_for_committed(void *token) = 0; void
>>>> >>>>> wait_for_applied(void *token) = 0; void wait_for_complete(void
>>>> >>>>> *token) = 0; /// Get result of operation int get_result(void
>>>> >>>>> *token) = 0; /// Must only be called once
>>>> >>>>> is_opearation_complete(token) void reset_operation_token(void
>>>> >>>>> *token) = 0; /// Must only be called once
>>>> >>>>> is_opearation_complete(token) void detroy_operation_token(void
>>>> >>>>> *token) = 0;
>>>> >>>>>
>>>> >>>>> /**
>>>> >>>>>  * Queue a transaction
>>>> >>>>>  *
>>>> >>>>>  * token must be either fresh or reset since the last operation.
>>>> >>>>>  * If the operation is completed syncronously, token can be
>>>> >>>>> resused
>>>> >>>>>  * without calling reset_operation_token.
>>>> >>>>>  *
>>>> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */
>>>> >>>>> int queue_transaction(
>>>> >>>>>   Transaction *t,
>>>> >>>>>   OpSequencer *osr,
>>>> >>>>>   void *token
>>>> >>>>>   ) = 0;
>>>> >>>>>
>>>> >>>>> /**
>>>> >>>>>  * Queue a transaction
>>>> >>>>>  *
>>>> >>>>>  * token must be either fresh or reset since the last operation.
>>>> >>>>>  * If the operation is completed syncronously, token can be
>>>> >>>>> resused
>>>> >>>>>  * without calling reset_operation_token.
>>>> >>>>>  *
>>>> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>> >>>>>  */
>>>> >>>>> int read(..., void *token) = 0; ...
>>>> >>>>>
>>>> >>>>> The "token" concept here is opaque to allow the implementation
>>>> >>>>> some flexibility.  Ideally, it would be nice to be able to
>>>> >>>>> include libaio operation contexts directly.
>>>> >>>>>
>>>> >>>>> The main goal here is for the backend to have the freedom to
>>>> >>>>> complete writes and reads asyncronously or syncronously as the
>>>> >>>>> sitation warrants.
>>>> >>>>> It also leaves the interface user in control of where the
>>>> >>>>> operation completion is handled.  Each op thread can therefore
>>>> >>>>> handle its own
>>>> >>>>> completions:
>>>> >>>>>
>>>> >>>>> struct InProgressOp {
>>>> >>>>>   PGRef pg;
>>>> >>>>>   ObjectStore::Token *token;
>>>> >>>>>   OpContext *ctx;
>>>> >>>>> };
>>>> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>>> >>>>
>>>> >>>> Probably a deque<> since we'll be pushign new requests and
>>>> >>>> slurping off completed ones?  Or, we can make token not
>>>> >>>> completely opaque, so that it includes a boost::intrusive::list
>>>> >>>> node and can be strung on a user-managed queue.
>>>> >>>>
>>>> >>>>> for (auto op : in_progress) {
>>>> >>>>>   op.token = objectstore->create_operation_token();
>>>> >>>>> }
>>>> >>>>>
>>>> >>>>> uint64_t next_to_start = 0;
>>>> >>>>> uint64_t next_to_complete = 0;
>>>> >>>>>
>>>> >>>>> while (1) {
>>>> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>> >>>>>     objectstore->wait_for_complete(op.token);
>>>> >>>>>   }
>>>> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>> >>>>>     if (objectstore->is_operation_complete(op.token)) {
>>>> >>>>>       PGRef pg = op.pg;
>>>> >>>>>       OpContext *ctx = op.ctx;
>>>> >>>>>       op.pg = PGRef();
>>>> >>>>>       op.ctx = nullptr;
>>>> >>>>>       objectstore->reset_operation_token(op.token);
>>>> >>>>>       if (pg->continue_op(
>>>> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>> >>>>>               == -EAGAIN) {
>>>> >>>>>         ++next_to_start;
>>>> >>>>>         continue;
>>>> >>>>>       }
>>>> >>>>>     } else {
>>>> >>>>>       break;
>>>> >>>>>     }
>>>> >>>>>   }
>>>> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>> >>>>>   if (dq.second->do_op(
>>>> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>> >>>>>           == -EAGAIN) {
>>>> >>>>>     ++next_to_start;
>>>> >>>>>   }
>>>> >>>>> }
>>>> >>>>>
>>>> >>>>> A design like this would allow the op thread to move onto
>>>> >>>>> another task if the objectstore implementation wants to
>>>> >>>>> perform an async operation.  For this to work, there is some
>>>> >>>>> work to be done:
>>>> >>>>>
>>>> >>>>> 1. All current reads in the read and write paths (probably
>>>> >>>>> including the attr
>>>> >>>>>    reads in get_object_context and friends) need to be able to handle
>>>> >>>>>    getting
>>>> >>>>>    -EAGAIN from the objectstore.
>>>> >>>>
>>>> >>>> Can we leave the old read methods in place as blocking
>>>> >>>> versions, and have them block on the token before returning?
>>>> >>>> That'll make the transition less painful.
>>>> >>>>
>>>> >>>>> 2. Writes and reads need to be able to handle having the pg
>>>> >>>>> lock dropped
>>>> >>>>>    during the operation.  This should be ok since the actual object
>>>> >>>>>    information
>>>> >>>>>    is protected by the RWState locks.
>>>> >>>>
>>>> >>>> All of the async write pieces already handle this (recheck PG
>>>> >>>> state after taking the lock).  If they don't get -EAGAIN they'd
>>>> >>>> just call the next stage, probably with a flag indicating that
>>>> >>>> validation can be skipped (since the lock hasn't been dropped)?
>>>> >>>>
>>>> >>>>> 3. OpContext needs to have enough information to pick up where
>>>> >>>>> the operation
>>>> >>>>>    left off.  This suggests that we should obtain all required
>>>> >>>>>    ObjectContexts
>>>> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>>> >>>>
>>>> >>>> Yeah...
>>>> >>>>
>>>> >>>>> 4. The object class interface will need to be replaced with a
>>>> >>>>> new interface
>>>> >>>>>    based on possibly async reads.  We can maintain compatibility with
>>>> >>>>>    the
>>>> >>>>>    current ones by launching a new thread to handle any message which
>>>> >>>>>    happens
>>>> >>>>>    to contain an old-style object class operation.
>>>> >>>>
>>>> >>>> Again, for now, wrappers would avoid this?
>>>> >>>>
>>>> >>>> s
>>>> >>>>>
>>>> >>>>> Most of this needs to happen to support object class
>>>> >>>>> operations on ec pools anyway.
>>>> >>>>> -Sam
>>>> >>>>> --
>>>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> >>>>> in
>>>> >>>>> the body of a message to majordomo@vger.kernel.org More
>>>> >>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> >>>>>
>>>> >>>>>
>>>> >>> --
>>>> >>> To unsubscribe from this list: send the line "unsubscribe
>>>> >>> ceph-devel" in the body of a message to
>>>> >>> majordomo@vger.kernel.org More majordomo info at
>>>> >>> http://vger.kernel.org/majordomo-info.html
>>>> >> --
>>>> >> To unsubscribe from this list: send the line "unsubscribe
>>>> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> >> More majordomo info at
>>>> >> http://vger.kernel.org/majordomo-info.html
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best Regards,
>>>> >
>>>> > Wheat
>>>> > --
>>>> > To unsubscribe from this list: send the line "unsubscribe
>>>> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Milosz Tanski
>>>> CTO
>>>> 16 East 34th Street, 15th floor
>>>> New York, NY 10016
>>>>
>>>> p: 646-253-9055
>>>> e: milosz@adfin.com
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 16 East 34th Street, 15th floor
>> New York, NY 10016
>>
>> p: 646-253-9055
>> e: milosz@adfin.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-08-28 21:25                   ` Samuel Just
@ 2015-10-28 20:15                     ` Samuel Just
  2015-10-30 16:20                       ` Jason Dillaman
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Just @ 2015-10-28 20:15 UTC (permalink / raw)
  To: ceph-devel
  Cc: Milosz Tanski, Matt Benjamin, Haomai Wang, Yehuda Sadeh-Weinraub,
	Sage Weil, Adam Emerson, Blinick, Stephen L

Sorry for the long delay since my last update, been working mostly on
infernalis stabilization :)

https://github.com/athanatos/ceph/tree/wip-do-op-2015-10-28

has my current branch.

I've changed direction a bit, a future based interface seems like it
will provide the most tolerable interface for sequencing operations on
possibly async objectstore results.  I've got an early version of a
common/Future.h fleshed out (very different goals from the C++ futures
library, I'm afraid -- based heavily on the seastar library) and a
commit which converts get_object_context to use the interface.

One of the main questions for me is whether we can get the overhead of
the Future interface and operation sequencing low enough to be
tolerable for fast synchronous backends.  I'm still working on trying
to get at least a simple read path converted to use async objectstore
operations so we can test the overhead and the performance improvement
for an async backend.
-Sam

On Fri, Aug 28, 2015 at 2:25 PM, Samuel Just <sjust@redhat.com> wrote:
> Oh, yeah, we'll definitely test for correctness for async reads on
> filestore, I'm just worried about validating the performance
> assumptions.  The 3700s might be just fine for that validation though.
> -Sam
>
> On Fri, Aug 28, 2015 at 1:01 PM, Blinick, Stephen L
> <stephen.l.blinick@intel.com> wrote:
>> This sounds ok, with the synchronous interface still possible to the ObjectStore based on return code.
>>
>> I'd think that the async read interface can be evaluated with any hardware, at least for correctness, by observing the queue depth to the device during a test run.  Also, I think asynchronous reads may benefit various types of NAND SSD's as they do better with more parallelism and I typically see very low queuedepth to them today with Filestore (one of the reasons I think doubling up OSD's on a single flash device helps benchmarks).
>>
>> Thanks,
>>
>> Stephen
>>
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Samuel Just
>> Sent: Thursday, August 27, 2015 4:22 PM
>> To: Milosz Tanski
>> Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil; ceph-devel
>> Subject: Re: Async reads, sync writes, op thread model discussion
>>
>> It's been a couple of weeks, so I thought I'd send out a short progress update.  I've started by trying to nail down enough of the threading design/async interface to start refactoring do_op.  For the moment, I've backtracked on the token approach mostly because it seemed more complicated than necessary.  I'm thinking we'll keep a callback like mechanism, but move responsibility for queuing and execution back to the interface user by allowing the user to pass a completion queue and an uninterpreted completion pointer.  These two commits have the gist of the direction I'm going in (the actual code is more a place holder today).  An OSDReactor instance will replace each of the "shards" in the current sharded work queue.  Any aio initiated by a pg operation from a reactor will pass that re
 actor's queue, ensuring that the completion winds up back in the same thread.
>> Writes would work pretty much the same way, but with two callbacks.
>>
>> My plan is to flesh this out to the point where the OSD works again, and then refactor the osd write path to use this mechanism for basic rbd writes.  That should be enough to let us evaluate whether this is a good path forward for async writes.  Async reads may be a bit tricky to evaluate.  It seems like we'd need hardware that needs that kind of queue depth and an objectstore implementation which can exploit it.
>> I'll wire up filestore to do async reads optionally for testing purposes, but it's not clear to me that there will be cases where filestore would want to do an async read rather than a sync read.
>>
>> https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4
>> https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6
>> -Sam
>>
>> On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski <milosz@adfin.com> wrote:
>>> On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenjamin@redhat.com> wrote:
>>>> Hi,
>>>>
>>>> I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.
>>>
>>> Not suggesting the libraries/frameworks. Just brining up promises as
>>> an alternative technique to coroutines. Dealing with spaghetti
>>> evented/callback code gets old after doing it for 10+ years. Then
>>> throw in blocking IO.
>>>
>>> And FYI, the data flow promises go back in comp sci back to the 80s.
>>>
>>> Cheers,
>>> - Milosz
>>>
>>>>
>>>> I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.
>>>>
>>>> My intuition is, since the goal is more deterministic performance in
>>>> a short horizion, you
>>>>
>>>> a. need to prioritize transparency over novel abstractions b. need to
>>>> build solid microbenchmarks that encapsulate small, then larger
>>>> pieces of the work pipeline
>>>>
>>>> My .05.
>>>>
>>>> Matt
>>>>
>>>> --
>>>> Matt Benjamin
>>>> Red Hat, Inc.
>>>> 315 West Huron Street, Suite 140A
>>>> Ann Arbor, Michigan 48103
>>>>
>>>> http://www.redhat.com/en/technologies/storage
>>>>
>>>> tel.  734-761-4689
>>>> fax.  734-769-8938
>>>> cel.  734-216-5309
>>>>
>>>> ----- Original Message -----
>>>>> From: "Milosz Tanski" <milosz@adfin.com>
>>>>> To: "Haomai Wang" <haomaiwang@gmail.com>
>>>>> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Samuel Just"
>>>>> <sjust@redhat.com>, "Sage Weil" <sage@newdream.net>,
>>>>> ceph-devel@vger.kernel.org
>>>>> Sent: Friday, August 14, 2015 4:56:26 PM
>>>>> Subject: Re: Async reads, sync writes, op thread model discussion
>>>>>
>>>>> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>>>>> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
>>>>> > <ysadehwe@redhat.com> wrote:
>>>>> >> Already mentioned it on irc, adding to ceph-devel for the sake of
>>>>> >> completeness. I did some infrastructure work for rgw and it seems
>>>>> >> (at least to me) that it could at least be partially useful here.
>>>>> >> Basically it's an async execution framework that utilizes coroutines.
>>>>> >> It's comprised of aio notification manager that can also be tied
>>>>> >> into coroutines execution. The coroutines themselves are
>>>>> >> stackless, they are implemented as state machines, but using some
>>>>> >> boost trickery to hide the details so they can be written very
>>>>> >> similar to blocking methods. Coroutines can also execute other
>>>>> >> coroutines and can be stacked, or can generate concurrent
>>>>> >> execution. It's still somewhat in flux, but I think it's mostly
>>>>> >> done and already useful at this point, so if there's anything you
>>>>> >> could use it might be a good idea to avoid effort duplication.
>>>>> >>
>>>>> >
>>>>> > coroutines like qemu is cool. The only thing I afraid is the
>>>>> > complicate of debug and it's really a big task :-(
>>>>> >
>>>>> > I agree with sage that this design is really a new implementation
>>>>> > for objectstore so that it's harmful to existing objectstore impl.
>>>>> > I also suffer the pain from sync read xattr, we may add a async
>>>>> > read interface to solove this?
>>>>> >
>>>>> > For context switch thing, now we have at least 3 cs for one op at
>>>>> > osd side. messenger -> op queue -> objectstore queue. I guess op
>>>>> > queue -> objectstore is easier to kick off just as sam said. We
>>>>> > can make write journal inline with queue_transaction, so the
>>>>> > caller could directly handle the transaction right now.
>>>>>
>>>>> I would caution agains coroutines (fibers) esp. in a multi-threaded
>>>>> environment. Posix has officially obsoleted the swapcontext family
>>>>> of functions in 1003.1-2004 and removed it in 1003.1-2008. That's
>>>>> because they were notoriously non portable, and buggy. And yes you
>>>>> can use something like boost::context / boost::coroutine instead but
>>>>> they also have platform limitations. These implementations tend to
>>>>> abuse / turn of various platform scrutiny features (like the one for
>>>>> setjmp/longjmp). And on top of that many platforms don't consider
>>>>> alternative context so you end up with obscure bugs. I've debugged
>>>>> my fair share of bugs in Mordor coroutines with C++ exceptions, and
>>>>> errno variables (since errno is really a function on linux and it's
>>>>> output a pointer to threads errno is marked pure) if your coroutine
>>>>> migrates threads. And you need to migrate them because of blocking
>>>>> and uneven processor/thread distribution.
>>>>>
>>>>> None of these are obstacles that can't be solved, but added together
>>>>> they become a pretty long term liability. So I think long and hard
>>>>> about it. Qemu doesn't have some of those issues because it's uses a
>>>>> single thread and a much simpler C ABI that it deals with.
>>>>>
>>>>> An alternative to coroutines that goes a long way towards solving
>>>>> the callback spaghetti problem is futures/promises. I'm not talking
>>>>> of the very future model that exists in C++11 library but more along
>>>>> the lines that exist in other languages (like what's being done in
>>>>> Javascript today). There's a good implementation of it Folly (the
>>>>> facebook c++11 library). They have a very nice piece of
>>>>> documentation here to understand how they work and how they differ.
>>>>>
>>>>> That future model is very handy when dealing with the callback
>>>>> control flow problem. You can chain a bunch of processing steps that
>>>>> requires some async action, return a future and continue so on and so forth.
>>>>> Also, it makes handling complex error cases easy by giving you a way
>>>>> to skip lots of processing steps strait to onError at the end of the
>>>>> chain.
>>>>>
>>>>> Take a look at folly. Take a look at the expanded boost futures
>>>>> (they call this future continuations:
>>>>> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization
>>>>> .html#thread.synchronization.futures.then
>>>>> ). Also, building a cut down future framework just for Ceph (or
>>>>> reduced set folly one) might be another option.
>>>>>
>>>>> Just an alternative.
>>>>>
>>>>> >
>>>>> > Anyway, I think we need to do some changes for this field.
>>>>> >
>>>>> >> Yehuda
>>>>> >>
>>>>> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com> wrote:
>>>>> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all
>>>>> >>> tied to the actual interface I presented so much as the notion
>>>>> >>> that the next thing to do is restructure the OpWQ users as async
>>>>> >>> state machines.
>>>>> >>> -Sam
>>>>> >>>
>>>>> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net> wrote:
>>>>> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>>>> >>>>> Currently, there are some deficiencies in how the OSD maps ops
>>>>> >>>>> onto
>>>>> >>>>> threads:
>>>>> >>>>>
>>>>> >>>>> 1. Reads are always syncronous limiting the queue depth seen
>>>>> >>>>> from the device
>>>>> >>>>>    and therefore the possible parallelism.
>>>>> >>>>> 2. Writes are always asyncronous forcing even very fast writes
>>>>> >>>>> to be completed
>>>>> >>>>>    in a seperate thread.
>>>>> >>>>> 3. do_op cannot surrender the thread/pg lock during an
>>>>> >>>>> operation forcing reads
>>>>> >>>>>    required to continue the operation to be syncronous.
>>>>> >>>>>
>>>>> >>>>> For spinning disks, this is mostly ok since they don't benefit
>>>>> >>>>> as much from large read queues, and writes (filestore with
>>>>> >>>>> journal) are too slow for the thread switches to make a big
>>>>> >>>>> difference.  For very fast flash, however, we want the
>>>>> >>>>> flexibility to allow the backend to perform writes
>>>>> >>>>> syncronously or asyncronously when it makes sense, and to
>>>>> >>>>> maintain a larger number of outstanding reads than we have
>>>>> >>>>> threads.  To that end, I suggest changing the ObjectStore
>>>>> >>>>> interface to be somewhat polling based:
>>>>> >>>>>
>>>>> >>>>> /// Create new token
>>>>> >>>>> void *create_operation_token() = 0; bool
>>>>> >>>>> is_operation_complete(void *token) = 0; bool
>>>>> >>>>> is_operation_committed(void *token) = 0; bool
>>>>> >>>>> is_operation_applied(void *token) = 0; void
>>>>> >>>>> wait_for_committed(void *token) = 0; void
>>>>> >>>>> wait_for_applied(void *token) = 0; void wait_for_complete(void
>>>>> >>>>> *token) = 0; /// Get result of operation int get_result(void
>>>>> >>>>> *token) = 0; /// Must only be called once
>>>>> >>>>> is_opearation_complete(token) void reset_operation_token(void
>>>>> >>>>> *token) = 0; /// Must only be called once
>>>>> >>>>> is_opearation_complete(token) void detroy_operation_token(void
>>>>> >>>>> *token) = 0;
>>>>> >>>>>
>>>>> >>>>> /**
>>>>> >>>>>  * Queue a transaction
>>>>> >>>>>  *
>>>>> >>>>>  * token must be either fresh or reset since the last operation.
>>>>> >>>>>  * If the operation is completed syncronously, token can be
>>>>> >>>>> resused
>>>>> >>>>>  * without calling reset_operation_token.
>>>>> >>>>>  *
>>>>> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */
>>>>> >>>>> int queue_transaction(
>>>>> >>>>>   Transaction *t,
>>>>> >>>>>   OpSequencer *osr,
>>>>> >>>>>   void *token
>>>>> >>>>>   ) = 0;
>>>>> >>>>>
>>>>> >>>>> /**
>>>>> >>>>>  * Queue a transaction
>>>>> >>>>>  *
>>>>> >>>>>  * token must be either fresh or reset since the last operation.
>>>>> >>>>>  * If the operation is completed syncronously, token can be
>>>>> >>>>> resused
>>>>> >>>>>  * without calling reset_operation_token.
>>>>> >>>>>  *
>>>>> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>>> >>>>>  */
>>>>> >>>>> int read(..., void *token) = 0; ...
>>>>> >>>>>
>>>>> >>>>> The "token" concept here is opaque to allow the implementation
>>>>> >>>>> some flexibility.  Ideally, it would be nice to be able to
>>>>> >>>>> include libaio operation contexts directly.
>>>>> >>>>>
>>>>> >>>>> The main goal here is for the backend to have the freedom to
>>>>> >>>>> complete writes and reads asyncronously or syncronously as the
>>>>> >>>>> sitation warrants.
>>>>> >>>>> It also leaves the interface user in control of where the
>>>>> >>>>> operation completion is handled.  Each op thread can therefore
>>>>> >>>>> handle its own
>>>>> >>>>> completions:
>>>>> >>>>>
>>>>> >>>>> struct InProgressOp {
>>>>> >>>>>   PGRef pg;
>>>>> >>>>>   ObjectStore::Token *token;
>>>>> >>>>>   OpContext *ctx;
>>>>> >>>>> };
>>>>> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>>>> >>>>
>>>>> >>>> Probably a deque<> since we'll be pushign new requests and
>>>>> >>>> slurping off completed ones?  Or, we can make token not
>>>>> >>>> completely opaque, so that it includes a boost::intrusive::list
>>>>> >>>> node and can be strung on a user-managed queue.
>>>>> >>>>
>>>>> >>>>> for (auto op : in_progress) {
>>>>> >>>>>   op.token = objectstore->create_operation_token();
>>>>> >>>>> }
>>>>> >>>>>
>>>>> >>>>> uint64_t next_to_start = 0;
>>>>> >>>>> uint64_t next_to_complete = 0;
>>>>> >>>>>
>>>>> >>>>> while (1) {
>>>>> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>> >>>>>     objectstore->wait_for_complete(op.token);
>>>>> >>>>>   }
>>>>> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>> >>>>>     if (objectstore->is_operation_complete(op.token)) {
>>>>> >>>>>       PGRef pg = op.pg;
>>>>> >>>>>       OpContext *ctx = op.ctx;
>>>>> >>>>>       op.pg = PGRef();
>>>>> >>>>>       op.ctx = nullptr;
>>>>> >>>>>       objectstore->reset_operation_token(op.token);
>>>>> >>>>>       if (pg->continue_op(
>>>>> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>> >>>>>               == -EAGAIN) {
>>>>> >>>>>         ++next_to_start;
>>>>> >>>>>         continue;
>>>>> >>>>>       }
>>>>> >>>>>     } else {
>>>>> >>>>>       break;
>>>>> >>>>>     }
>>>>> >>>>>   }
>>>>> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>>> >>>>>   if (dq.second->do_op(
>>>>> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>> >>>>>           == -EAGAIN) {
>>>>> >>>>>     ++next_to_start;
>>>>> >>>>>   }
>>>>> >>>>> }
>>>>> >>>>>
>>>>> >>>>> A design like this would allow the op thread to move onto
>>>>> >>>>> another task if the objectstore implementation wants to
>>>>> >>>>> perform an async operation.  For this to work, there is some
>>>>> >>>>> work to be done:
>>>>> >>>>>
>>>>> >>>>> 1. All current reads in the read and write paths (probably
>>>>> >>>>> including the attr
>>>>> >>>>>    reads in get_object_context and friends) need to be able to handle
>>>>> >>>>>    getting
>>>>> >>>>>    -EAGAIN from the objectstore.
>>>>> >>>>
>>>>> >>>> Can we leave the old read methods in place as blocking
>>>>> >>>> versions, and have them block on the token before returning?
>>>>> >>>> That'll make the transition less painful.
>>>>> >>>>
>>>>> >>>>> 2. Writes and reads need to be able to handle having the pg
>>>>> >>>>> lock dropped
>>>>> >>>>>    during the operation.  This should be ok since the actual object
>>>>> >>>>>    information
>>>>> >>>>>    is protected by the RWState locks.
>>>>> >>>>
>>>>> >>>> All of the async write pieces already handle this (recheck PG
>>>>> >>>> state after taking the lock).  If they don't get -EAGAIN they'd
>>>>> >>>> just call the next stage, probably with a flag indicating that
>>>>> >>>> validation can be skipped (since the lock hasn't been dropped)?
>>>>> >>>>
>>>>> >>>>> 3. OpContext needs to have enough information to pick up where
>>>>> >>>>> the operation
>>>>> >>>>>    left off.  This suggests that we should obtain all required
>>>>> >>>>>    ObjectContexts
>>>>> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>>>> >>>>
>>>>> >>>> Yeah...
>>>>> >>>>
>>>>> >>>>> 4. The object class interface will need to be replaced with a
>>>>> >>>>> new interface
>>>>> >>>>>    based on possibly async reads.  We can maintain compatibility with
>>>>> >>>>>    the
>>>>> >>>>>    current ones by launching a new thread to handle any message which
>>>>> >>>>>    happens
>>>>> >>>>>    to contain an old-style object class operation.
>>>>> >>>>
>>>>> >>>> Again, for now, wrappers would avoid this?
>>>>> >>>>
>>>>> >>>> s
>>>>> >>>>>
>>>>> >>>>> Most of this needs to happen to support object class
>>>>> >>>>> operations on ec pools anyway.
>>>>> >>>>> -Sam
>>>>> >>>>> --
>>>>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> >>>>> in
>>>>> >>>>> the body of a message to majordomo@vger.kernel.org More
>>>>> >>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>> --
>>>>> >>> To unsubscribe from this list: send the line "unsubscribe
>>>>> >>> ceph-devel" in the body of a message to
>>>>> >>> majordomo@vger.kernel.org More majordomo info at
>>>>> >>> http://vger.kernel.org/majordomo-info.html
>>>>> >> --
>>>>> >> To unsubscribe from this list: send the line "unsubscribe
>>>>> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> >> More majordomo info at
>>>>> >> http://vger.kernel.org/majordomo-info.html
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Best Regards,
>>>>> >
>>>>> > Wheat
>>>>> > --
>>>>> > To unsubscribe from this list: send the line "unsubscribe
>>>>> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Milosz Tanski
>>>>> CTO
>>>>> 16 East 34th Street, 15th floor
>>>>> New York, NY 10016
>>>>>
>>>>> p: 646-253-9055
>>>>> e: milosz@adfin.com
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>
>>>
>>>
>>> --
>>> Milosz Tanski
>>> CTO
>>> 16 East 34th Street, 15th floor
>>> New York, NY 10016
>>>
>>> p: 646-253-9055
>>> e: milosz@adfin.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
  2015-10-28 20:15                     ` Samuel Just
@ 2015-10-30 16:20                       ` Jason Dillaman
       [not found]                         ` <1491577902.41976602.1446223269619.JavaMail.zimbra@redhat.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Dillaman @ 2015-10-30 16:20 UTC (permalink / raw)
  To: Samuel Just
  Cc: ceph-devel, Milosz Tanski, Matt Benjamin, Haomai Wang,
	Yehuda Sadeh-Weinraub, Sage Weil, Adam Emerson, Stephen L Blinick

In terms of overhead for the interface, is your concern due to the potential for more heap allocations for storing tasks to complete the future?  If that's the case, Adam's std::function replacement [1], which allocates a configurable amount of storage on the stack (as opposed to going to the heap for functions larger than 16 bytes as in the GCC implementation [2]), might be a good compromise.  

I had some time to deep-dive through the seastar library this week and was really impressed by its design.  I know we (semi-)recently needed to introduce a workaround change to librbd which now results in a context switch for each AIO op -- all to ensure that we never block the QEMU IO thread.  Long term, I would love if we could take some of these reactive programming concepts to avoid all potential blocking on the librbd / librados IO path (with a side benefit of perhaps eliminating the potential for thousands of network IO threads by instead sharding the work).

[1] http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/27100
[2] http://www.drdobbs.com/cpp/efficient-use-of-lambda-expressions-and/232500059?pgno=2

-- 

Jason Dillaman 


----- Original Message -----
> From: "Samuel Just" <sjust@redhat.com>
> To: "ceph-devel" <ceph-devel@vger.kernel.org>
> Cc: "Milosz Tanski" <milosz@adfin.com>, "Matt Benjamin" <mbenjamin@redhat.com>, "Haomai Wang" <haomaiwang@gmail.com>,
> "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Sage Weil" <sage@newdream.net>, "Adam Emerson"
> <aemerson@redhat.com>, "Stephen L Blinick" <stephen.l.blinick@intel.com>
> Sent: Wednesday, October 28, 2015 4:15:11 PM
> Subject: Re: Async reads, sync writes, op thread model discussion
> 
> Sorry for the long delay since my last update, been working mostly on
> infernalis stabilization :)
> 
> https://github.com/athanatos/ceph/tree/wip-do-op-2015-10-28
> 
> has my current branch.
> 
> I've changed direction a bit, a future based interface seems like it
> will provide the most tolerable interface for sequencing operations on
> possibly async objectstore results.  I've got an early version of a
> common/Future.h fleshed out (very different goals from the C++ futures
> library, I'm afraid -- based heavily on the seastar library) and a
> commit which converts get_object_context to use the interface.
> 
> One of the main questions for me is whether we can get the overhead of
> the Future interface and operation sequencing low enough to be
> tolerable for fast synchronous backends.  I'm still working on trying
> to get at least a simple read path converted to use async objectstore
> operations so we can test the overhead and the performance improvement
> for an async backend.
> -Sam
> 
> On Fri, Aug 28, 2015 at 2:25 PM, Samuel Just <sjust@redhat.com> wrote:
> > Oh, yeah, we'll definitely test for correctness for async reads on
> > filestore, I'm just worried about validating the performance
> > assumptions.  The 3700s might be just fine for that validation though.
> > -Sam
> >
> > On Fri, Aug 28, 2015 at 1:01 PM, Blinick, Stephen L
> > <stephen.l.blinick@intel.com> wrote:
> >> This sounds ok, with the synchronous interface still possible to the
> >> ObjectStore based on return code.
> >>
> >> I'd think that the async read interface can be evaluated with any
> >> hardware, at least for correctness, by observing the queue depth to the
> >> device during a test run.  Also, I think asynchronous reads may benefit
> >> various types of NAND SSD's as they do better with more parallelism and I
> >> typically see very low queuedepth to them today with Filestore (one of
> >> the reasons I think doubling up OSD's on a single flash device helps
> >> benchmarks).
> >>
> >> Thanks,
> >>
> >> Stephen
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org
> >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Samuel Just
> >> Sent: Thursday, August 27, 2015 4:22 PM
> >> To: Milosz Tanski
> >> Cc: Matt Benjamin; Haomai Wang; Yehuda Sadeh-Weinraub; Sage Weil;
> >> ceph-devel
> >> Subject: Re: Async reads, sync writes, op thread model discussion
> >>
> >> It's been a couple of weeks, so I thought I'd send out a short progress
> >> update.  I've started by trying to nail down enough of the threading
> >> design/async interface to start refactoring do_op.  For the moment, I've
> >> backtracked on the token approach mostly because it seemed more
> >> complicated than necessary.  I'm thinking we'll keep a callback like
> >> mechanism, but move responsibility for queuing and execution back to the
> >> interface user by allowing the user to pass a completion queue and an
> >> uninterpreted completion pointer.  These two commits have the gist of the
> >> direction I'm going in (the actual code is more a place holder today).
> >> An OSDReactor instance will replace each of the "shards" in the current
> >> sharded work queue.  Any aio initiated by a pg operation from a reactor
> >> will pass that reactor's queue, ensuring that the completion winds up
> >> back in the same thread.
> >> Writes would work pretty much the same way, but with two callbacks.
> >>
> >> My plan is to flesh this out to the point where the OSD works again, and
> >> then refactor the osd write path to use this mechanism for basic rbd
> >> writes.  That should be enough to let us evaluate whether this is a good
> >> path forward for async writes.  Async reads may be a bit tricky to
> >> evaluate.  It seems like we'd need hardware that needs that kind of queue
> >> depth and an objectstore implementation which can exploit it.
> >> I'll wire up filestore to do async reads optionally for testing purposes,
> >> but it's not clear to me that there will be cases where filestore would
> >> want to do an async read rather than a sync read.
> >>
> >> https://github.com/athanatos/ceph/commit/642b7190d70a5970534b911f929e6e3885bf99c4
> >> https://github.com/athanatos/ceph/commit/42bee815081a91abd003bf7170ef1270f23222f6
> >> -Sam
> >>
> >> On Fri, Aug 14, 2015 at 3:36 PM, Milosz Tanski <milosz@adfin.com> wrote:
> >>> On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenjamin@redhat.com>
> >>> wrote:
> >>>> Hi,
> >>>>
> >>>> I tend to agree with your comments regarding swapcontext/fibers.  I am
> >>>> not much more enamored of jumping to new models (new! frameworks!) as a
> >>>> single jump, either.
> >>>
> >>> Not suggesting the libraries/frameworks. Just brining up promises as
> >>> an alternative technique to coroutines. Dealing with spaghetti
> >>> evented/callback code gets old after doing it for 10+ years. Then
> >>> throw in blocking IO.
> >>>
> >>> And FYI, the data flow promises go back in comp sci back to the 80s.
> >>>
> >>> Cheers,
> >>> - Milosz
> >>>
> >>>>
> >>>> I like the way I interpreted Sam's design to be going, and in
> >>>> particular, that it seems to allow for consistent handling of read,
> >>>> write transactions.  I also would like to see how Yehuda's system works
> >>>> before arguing generalities.
> >>>>
> >>>> My intuition is, since the goal is more deterministic performance in
> >>>> a short horizion, you
> >>>>
> >>>> a. need to prioritize transparency over novel abstractions b. need to
> >>>> build solid microbenchmarks that encapsulate small, then larger
> >>>> pieces of the work pipeline
> >>>>
> >>>> My .05.
> >>>>
> >>>> Matt
> >>>>
> >>>> --
> >>>> Matt Benjamin
> >>>> Red Hat, Inc.
> >>>> 315 West Huron Street, Suite 140A
> >>>> Ann Arbor, Michigan 48103
> >>>>
> >>>> http://www.redhat.com/en/technologies/storage
> >>>>
> >>>> tel.  734-761-4689
> >>>> fax.  734-769-8938
> >>>> cel.  734-216-5309
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Milosz Tanski" <milosz@adfin.com>
> >>>>> To: "Haomai Wang" <haomaiwang@gmail.com>
> >>>>> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@redhat.com>, "Samuel Just"
> >>>>> <sjust@redhat.com>, "Sage Weil" <sage@newdream.net>,
> >>>>> ceph-devel@vger.kernel.org
> >>>>> Sent: Friday, August 14, 2015 4:56:26 PM
> >>>>> Subject: Re: Async reads, sync writes, op thread model discussion
> >>>>>
> >>>>> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@gmail.com>
> >>>>> wrote:
> >>>>> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
> >>>>> > <ysadehwe@redhat.com> wrote:
> >>>>> >> Already mentioned it on irc, adding to ceph-devel for the sake of
> >>>>> >> completeness. I did some infrastructure work for rgw and it seems
> >>>>> >> (at least to me) that it could at least be partially useful here.
> >>>>> >> Basically it's an async execution framework that utilizes
> >>>>> >> coroutines.
> >>>>> >> It's comprised of aio notification manager that can also be tied
> >>>>> >> into coroutines execution. The coroutines themselves are
> >>>>> >> stackless, they are implemented as state machines, but using some
> >>>>> >> boost trickery to hide the details so they can be written very
> >>>>> >> similar to blocking methods. Coroutines can also execute other
> >>>>> >> coroutines and can be stacked, or can generate concurrent
> >>>>> >> execution. It's still somewhat in flux, but I think it's mostly
> >>>>> >> done and already useful at this point, so if there's anything you
> >>>>> >> could use it might be a good idea to avoid effort duplication.
> >>>>> >>
> >>>>> >
> >>>>> > coroutines like qemu is cool. The only thing I afraid is the
> >>>>> > complicate of debug and it's really a big task :-(
> >>>>> >
> >>>>> > I agree with sage that this design is really a new implementation
> >>>>> > for objectstore so that it's harmful to existing objectstore impl.
> >>>>> > I also suffer the pain from sync read xattr, we may add a async
> >>>>> > read interface to solove this?
> >>>>> >
> >>>>> > For context switch thing, now we have at least 3 cs for one op at
> >>>>> > osd side. messenger -> op queue -> objectstore queue. I guess op
> >>>>> > queue -> objectstore is easier to kick off just as sam said. We
> >>>>> > can make write journal inline with queue_transaction, so the
> >>>>> > caller could directly handle the transaction right now.
> >>>>>
> >>>>> I would caution agains coroutines (fibers) esp. in a multi-threaded
> >>>>> environment. Posix has officially obsoleted the swapcontext family
> >>>>> of functions in 1003.1-2004 and removed it in 1003.1-2008. That's
> >>>>> because they were notoriously non portable, and buggy. And yes you
> >>>>> can use something like boost::context / boost::coroutine instead but
> >>>>> they also have platform limitations. These implementations tend to
> >>>>> abuse / turn of various platform scrutiny features (like the one for
> >>>>> setjmp/longjmp). And on top of that many platforms don't consider
> >>>>> alternative context so you end up with obscure bugs. I've debugged
> >>>>> my fair share of bugs in Mordor coroutines with C++ exceptions, and
> >>>>> errno variables (since errno is really a function on linux and it's
> >>>>> output a pointer to threads errno is marked pure) if your coroutine
> >>>>> migrates threads. And you need to migrate them because of blocking
> >>>>> and uneven processor/thread distribution.
> >>>>>
> >>>>> None of these are obstacles that can't be solved, but added together
> >>>>> they become a pretty long term liability. So I think long and hard
> >>>>> about it. Qemu doesn't have some of those issues because it's uses a
> >>>>> single thread and a much simpler C ABI that it deals with.
> >>>>>
> >>>>> An alternative to coroutines that goes a long way towards solving
> >>>>> the callback spaghetti problem is futures/promises. I'm not talking
> >>>>> of the very future model that exists in C++11 library but more along
> >>>>> the lines that exist in other languages (like what's being done in
> >>>>> Javascript today). There's a good implementation of it Folly (the
> >>>>> facebook c++11 library). They have a very nice piece of
> >>>>> documentation here to understand how they work and how they differ.
> >>>>>
> >>>>> That future model is very handy when dealing with the callback
> >>>>> control flow problem. You can chain a bunch of processing steps that
> >>>>> requires some async action, return a future and continue so on and so
> >>>>> forth.
> >>>>> Also, it makes handling complex error cases easy by giving you a way
> >>>>> to skip lots of processing steps strait to onError at the end of the
> >>>>> chain.
> >>>>>
> >>>>> Take a look at folly. Take a look at the expanded boost futures
> >>>>> (they call this future continuations:
> >>>>> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization
> >>>>> .html#thread.synchronization.futures.then
> >>>>> ). Also, building a cut down future framework just for Ceph (or
> >>>>> reduced set folly one) might be another option.
> >>>>>
> >>>>> Just an alternative.
> >>>>>
> >>>>> >
> >>>>> > Anyway, I think we need to do some changes for this field.
> >>>>> >
> >>>>> >> Yehuda
> >>>>> >>
> >>>>> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@redhat.com>
> >>>>> >> wrote:
> >>>>> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all
> >>>>> >>> tied to the actual interface I presented so much as the notion
> >>>>> >>> that the next thing to do is restructure the OpWQ users as async
> >>>>> >>> state machines.
> >>>>> >>> -Sam
> >>>>> >>>
> >>>>> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@newdream.net>
> >>>>> >>> wrote:
> >>>>> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
> >>>>> >>>>> Currently, there are some deficiencies in how the OSD maps ops
> >>>>> >>>>> onto
> >>>>> >>>>> threads:
> >>>>> >>>>>
> >>>>> >>>>> 1. Reads are always syncronous limiting the queue depth seen
> >>>>> >>>>> from the device
> >>>>> >>>>>    and therefore the possible parallelism.
> >>>>> >>>>> 2. Writes are always asyncronous forcing even very fast writes
> >>>>> >>>>> to be completed
> >>>>> >>>>>    in a seperate thread.
> >>>>> >>>>> 3. do_op cannot surrender the thread/pg lock during an
> >>>>> >>>>> operation forcing reads
> >>>>> >>>>>    required to continue the operation to be syncronous.
> >>>>> >>>>>
> >>>>> >>>>> For spinning disks, this is mostly ok since they don't benefit
> >>>>> >>>>> as much from large read queues, and writes (filestore with
> >>>>> >>>>> journal) are too slow for the thread switches to make a big
> >>>>> >>>>> difference.  For very fast flash, however, we want the
> >>>>> >>>>> flexibility to allow the backend to perform writes
> >>>>> >>>>> syncronously or asyncronously when it makes sense, and to
> >>>>> >>>>> maintain a larger number of outstanding reads than we have
> >>>>> >>>>> threads.  To that end, I suggest changing the ObjectStore
> >>>>> >>>>> interface to be somewhat polling based:
> >>>>> >>>>>
> >>>>> >>>>> /// Create new token
> >>>>> >>>>> void *create_operation_token() = 0; bool
> >>>>> >>>>> is_operation_complete(void *token) = 0; bool
> >>>>> >>>>> is_operation_committed(void *token) = 0; bool
> >>>>> >>>>> is_operation_applied(void *token) = 0; void
> >>>>> >>>>> wait_for_committed(void *token) = 0; void
> >>>>> >>>>> wait_for_applied(void *token) = 0; void wait_for_complete(void
> >>>>> >>>>> *token) = 0; /// Get result of operation int get_result(void
> >>>>> >>>>> *token) = 0; /// Must only be called once
> >>>>> >>>>> is_opearation_complete(token) void reset_operation_token(void
> >>>>> >>>>> *token) = 0; /// Must only be called once
> >>>>> >>>>> is_opearation_complete(token) void detroy_operation_token(void
> >>>>> >>>>> *token) = 0;
> >>>>> >>>>>
> >>>>> >>>>> /**
> >>>>> >>>>>  * Queue a transaction
> >>>>> >>>>>  *
> >>>>> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>> >>>>>  * If the operation is completed syncronously, token can be
> >>>>> >>>>> resused
> >>>>> >>>>>  * without calling reset_operation_token.
> >>>>> >>>>>  *
> >>>>> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */
> >>>>> >>>>> int queue_transaction(
> >>>>> >>>>>   Transaction *t,
> >>>>> >>>>>   OpSequencer *osr,
> >>>>> >>>>>   void *token
> >>>>> >>>>>   ) = 0;
> >>>>> >>>>>
> >>>>> >>>>> /**
> >>>>> >>>>>  * Queue a transaction
> >>>>> >>>>>  *
> >>>>> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>> >>>>>  * If the operation is completed syncronously, token can be
> >>>>> >>>>> resused
> >>>>> >>>>>  * without calling reset_operation_token.
> >>>>> >>>>>  *
> >>>>> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
> >>>>> >>>>>  */
> >>>>> >>>>> int read(..., void *token) = 0; ...
> >>>>> >>>>>
> >>>>> >>>>> The "token" concept here is opaque to allow the implementation
> >>>>> >>>>> some flexibility.  Ideally, it would be nice to be able to
> >>>>> >>>>> include libaio operation contexts directly.
> >>>>> >>>>>
> >>>>> >>>>> The main goal here is for the backend to have the freedom to
> >>>>> >>>>> complete writes and reads asyncronously or syncronously as the
> >>>>> >>>>> sitation warrants.
> >>>>> >>>>> It also leaves the interface user in control of where the
> >>>>> >>>>> operation completion is handled.  Each op thread can therefore
> >>>>> >>>>> handle its own
> >>>>> >>>>> completions:
> >>>>> >>>>>
> >>>>> >>>>> struct InProgressOp {
> >>>>> >>>>>   PGRef pg;
> >>>>> >>>>>   ObjectStore::Token *token;
> >>>>> >>>>>   OpContext *ctx;
> >>>>> >>>>> };
> >>>>> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
> >>>>> >>>>
> >>>>> >>>> Probably a deque<> since we'll be pushign new requests and
> >>>>> >>>> slurping off completed ones?  Or, we can make token not
> >>>>> >>>> completely opaque, so that it includes a boost::intrusive::list
> >>>>> >>>> node and can be strung on a user-managed queue.
> >>>>> >>>>
> >>>>> >>>>> for (auto op : in_progress) {
> >>>>> >>>>>   op.token = objectstore->create_operation_token();
> >>>>> >>>>> }
> >>>>> >>>>>
> >>>>> >>>>> uint64_t next_to_start = 0;
> >>>>> >>>>> uint64_t next_to_complete = 0;
> >>>>> >>>>>
> >>>>> >>>>> while (1) {
> >>>>> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
> >>>>> >>>>>     InProgressOp &op = in_progress[next_to_complete %
> >>>>> >>>>>     MAX_IN_PROGRESS];
> >>>>> >>>>>     objectstore->wait_for_complete(op.token);
> >>>>> >>>>>   }
> >>>>> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
> >>>>> >>>>>     InProgressOp &op = in_progress[next_to_complete %
> >>>>> >>>>>     MAX_IN_PROGRESS];
> >>>>> >>>>>     if (objectstore->is_operation_complete(op.token)) {
> >>>>> >>>>>       PGRef pg = op.pg;
> >>>>> >>>>>       OpContext *ctx = op.ctx;
> >>>>> >>>>>       op.pg = PGRef();
> >>>>> >>>>>       op.ctx = nullptr;
> >>>>> >>>>>       objectstore->reset_operation_token(op.token);
> >>>>> >>>>>       if (pg->continue_op(
> >>>>> >>>>>             ctx, &in_progress_ops[next_to_start %
> >>>>> >>>>>             MAX_IN_PROGRESS])
> >>>>> >>>>>               == -EAGAIN) {
> >>>>> >>>>>         ++next_to_start;
> >>>>> >>>>>         continue;
> >>>>> >>>>>       }
> >>>>> >>>>>     } else {
> >>>>> >>>>>       break;
> >>>>> >>>>>     }
> >>>>> >>>>>   }
> >>>>> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
> >>>>> >>>>>   if (dq.second->do_op(
> >>>>> >>>>>         dq.first, &in_progress_ops[next_to_start %
> >>>>> >>>>>         MAX_IN_PROGRESS])
> >>>>> >>>>>           == -EAGAIN) {
> >>>>> >>>>>     ++next_to_start;
> >>>>> >>>>>   }
> >>>>> >>>>> }
> >>>>> >>>>>
> >>>>> >>>>> A design like this would allow the op thread to move onto
> >>>>> >>>>> another task if the objectstore implementation wants to
> >>>>> >>>>> perform an async operation.  For this to work, there is some
> >>>>> >>>>> work to be done:
> >>>>> >>>>>
> >>>>> >>>>> 1. All current reads in the read and write paths (probably
> >>>>> >>>>> including the attr
> >>>>> >>>>>    reads in get_object_context and friends) need to be able to
> >>>>> >>>>>    handle
> >>>>> >>>>>    getting
> >>>>> >>>>>    -EAGAIN from the objectstore.
> >>>>> >>>>
> >>>>> >>>> Can we leave the old read methods in place as blocking
> >>>>> >>>> versions, and have them block on the token before returning?
> >>>>> >>>> That'll make the transition less painful.
> >>>>> >>>>
> >>>>> >>>>> 2. Writes and reads need to be able to handle having the pg
> >>>>> >>>>> lock dropped
> >>>>> >>>>>    during the operation.  This should be ok since the actual
> >>>>> >>>>>    object
> >>>>> >>>>>    information
> >>>>> >>>>>    is protected by the RWState locks.
> >>>>> >>>>
> >>>>> >>>> All of the async write pieces already handle this (recheck PG
> >>>>> >>>> state after taking the lock).  If they don't get -EAGAIN they'd
> >>>>> >>>> just call the next stage, probably with a flag indicating that
> >>>>> >>>> validation can be skipped (since the lock hasn't been dropped)?
> >>>>> >>>>
> >>>>> >>>>> 3. OpContext needs to have enough information to pick up where
> >>>>> >>>>> the operation
> >>>>> >>>>>    left off.  This suggests that we should obtain all required
> >>>>> >>>>>    ObjectContexts
> >>>>> >>>>>    at the beginning of the operation.  Cache/Tiering complicates
> >>>>> >>>>>    this.
> >>>>> >>>>
> >>>>> >>>> Yeah...
> >>>>> >>>>
> >>>>> >>>>> 4. The object class interface will need to be replaced with a
> >>>>> >>>>> new interface
> >>>>> >>>>>    based on possibly async reads.  We can maintain compatibility
> >>>>> >>>>>    with
> >>>>> >>>>>    the
> >>>>> >>>>>    current ones by launching a new thread to handle any message
> >>>>> >>>>>    which
> >>>>> >>>>>    happens
> >>>>> >>>>>    to contain an old-style object class operation.
> >>>>> >>>>
> >>>>> >>>> Again, for now, wrappers would avoid this?
> >>>>> >>>>
> >>>>> >>>> s
> >>>>> >>>>>
> >>>>> >>>>> Most of this needs to happen to support object class
> >>>>> >>>>> operations on ec pools anyway.
> >>>>> >>>>> -Sam
> >>>>> >>>>> --
> >>>>> >>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>> >>>>> ceph-devel"
> >>>>> >>>>> in
> >>>>> >>>>> the body of a message to majordomo@vger.kernel.org More
> >>>>> >>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>> >>>>>
> >>>>> >>>>>
> >>>>> >>> --
> >>>>> >>> To unsubscribe from this list: send the line "unsubscribe
> >>>>> >>> ceph-devel" in the body of a message to
> >>>>> >>> majordomo@vger.kernel.org More majordomo info at
> >>>>> >>> http://vger.kernel.org/majordomo-info.html
> >>>>> >> --
> >>>>> >> To unsubscribe from this list: send the line "unsubscribe
> >>>>> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>>> >> More majordomo info at
> >>>>> >> http://vger.kernel.org/majordomo-info.html
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > Best Regards,
> >>>>> >
> >>>>> > Wheat
> >>>>> > --
> >>>>> > To unsubscribe from this list: send the line "unsubscribe
> >>>>> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Milosz Tanski
> >>>>> CTO
> >>>>> 16 East 34th Street, 15th floor
> >>>>> New York, NY 10016
> >>>>>
> >>>>> p: 646-253-9055
> >>>>> e: milosz@adfin.com
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Milosz Tanski
> >>> CTO
> >>> 16 East 34th Street, 15th floor
> >>> New York, NY 10016
> >>>
> >>> p: 646-253-9055
> >>> e: milosz@adfin.com
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org More majordomo info at
> >> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Async reads, sync writes, op thread model discussion
       [not found]                         ` <1491577902.41976602.1446223269619.JavaMail.zimbra@redhat.com>
@ 2015-10-30 16:55                           ` Adam C. Emerson
  0 siblings, 0 replies; 17+ messages in thread
From: Adam C. Emerson @ 2015-10-30 16:55 UTC (permalink / raw)
  To: Jason Dillaman
  Cc: Samuel Just, ceph-devel, Milosz Tanski, Matt Benjamin,
	Haomai Wang, Yehuda Sadeh-Weinraub, Sage Weil, Adam Emerson,
	Stephen L Blinick

On 30/10/2015, Jason Dillaman wrote:
> In terms of overhead for the interface, is your concern due to the
> potential for more heap allocations for storing tasks to complete
> the future?  If that's the case, Adam's std::function replacement [1],
> which allocates a configurable amount of storage on the stack (as opposed
> to going to the heap for functions larger than 16 bytes as in the GCC
> implementation [2]), might be a good compromise.
> 
> I had some time to deep-dive through the seastar library this week and
> was really impressed by its design.  I know we (semi-)recently needed to
> introduce a workaround change to librbd which now results in a context
> switch for each AIO op -- all to ensure that we never block the QEMU IO
> thread.  Long term, I would love if we could take some of these reactive
> programming concepts to avoid all potential blocking on the librbd /
> librados IO path (with a side benefit of perhaps eliminating the potential
> for thousands of network IO threads by instead sharding the work).
> 
> [1] http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/27100
> [2] http://www.drdobbs.com/cpp/efficient-use-of-lambda-expressions-and/232500059?pgno=2

We have been working on a short run prototype/research project to see
about using Seastar itself as a reimplementation of the OSD top-half that
can support the semantics that Ceph's placement group system currently
relies upon as well as different guarantees alongside them. This is
part of our effort to repackage and upstream the Cohort low-latency
OSD work in a way that will integrate better with the existing Ceph
requirements. (Eric is also hoping to use this as a model for
implementing and evaluating DM-CLOCK.)

We did find that replacing Context* with the static-reservation function
implementation greatly improved the performance of Objecter (which
is what I initially designed it for), but there was still a lot of
lock contention. While Seastar itself as it exists is not suitable for
client applications (since it wants to take over your whole program and
assumes it owns cores in their entirety) a refactoring or inspiration
from Seastar that included futures, the CPU-affine allocator, their
mostly lock-free approach, and their userspace network support would
likely give even better gains in Objecter performance. We are not (yet)
working on Objecter, but we would love to collaborate with anyone wanting
to do this work.

-- 
Adam C. Emerson, Senior Software Engineer
Red Hat Storage, Ann Arbor, Michigan, United States

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-10-30 16:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-11 16:40 Async reads, sync writes, op thread model discussion Samuel Just
2015-08-11 20:05 ` Sage Weil
2015-08-11 22:19   ` Samuel Just
2015-08-11 22:34     ` Yehuda Sadeh-Weinraub
2015-08-12  2:50       ` Haomai Wang
2015-08-12  6:55         ` Somnath Roy
2015-08-12  7:04           ` Haomai Wang
2015-08-14 20:56         ` Milosz Tanski
2015-08-14 21:19           ` Matt Benjamin
2015-08-14 21:39             ` Blinick, Stephen L
2015-08-14 22:36             ` Milosz Tanski
2015-08-27 23:21               ` Samuel Just
2015-08-28 20:01                 ` Blinick, Stephen L
2015-08-28 21:25                   ` Samuel Just
2015-10-28 20:15                     ` Samuel Just
2015-10-30 16:20                       ` Jason Dillaman
     [not found]                         ` <1491577902.41976602.1446223269619.JavaMail.zimbra@redhat.com>
2015-10-30 16:55                           ` Adam C. Emerson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.