Ann Arbor Team's Flexible I/O Proposals (Ceph Next)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
@ 2016-04-15 21:05 Adam C. Emerson
  2016-04-15 21:25 ` Milosz Tanski
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Adam C. Emerson @ 2016-04-15 21:05 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

Ceph Developers,

We've put together a few of the main ideas from our previous work in a
brief form that we hope people will be able to digest, consider, and
debate. We'd also like to discuss them with you at Ceph Next this
Tuesday.

Thank you.

---8<---

We have been looking at improvements to Ceph, particularly RADOS,
while focusing on flexibility (allowing users to do more things)
and performance. We have come up with a few proposals with these two
things in mind. Sessions and read-write transactions aim to allow
clients to batch up multiple operations in a way that is safe and
correct, while allowing clients to gain the advantages of atomic
read-write operations without having to lock. Sessions also provide
a foundation for flow-control which ultimately improves performance
by preventing an OSD from being ground into uselessness under a
storm of impossible requests. The CLS proposal is a logical follow-on
from the read-write proposal, as we attempt to address some problems
of correctness that exist now and consider how to integrate the
facility into an asynchronous world.

Flexible Placement, as you would expect from the name, is about
allowing users more control, as are Flexible Semantics. They both
have profound performance implications, as tuning placement to better
match a workload can increase throughput, and relaxed consistency can 
decrease latency. The proposed Interfaces are meant to support both as
well as work currently being done to allow an asynchronous OSD and to
hide details like locking and thread pools so that backends can be
written with different forms of concurrency and load-balancing
across processors.

Finally, Map Partitioning is not directly related to code paths within
the OSD itself, but does affect everything that can be done with Ceph.
People are beginning to run into limits on how large a Ceph cluster can
grow and how many ways they can be partitioned, and both these problems
fundamentally derive from the way the OSD map is handled by the monitors.

There are also some notes at the end. They are not critical, but if you
find yourself asking "What were they thinking?" the notes might help.

# Sessions and Read-Write #

From `ReplicatedPG.cc`.

```c++
// Write operations aren't allowed to return a data payload because
// we can't do so reliably. If the client has to resend the request
// and it has already been applied, we will return 0 with no
// payload.  Non-deterministic behavior is no good.  However, it is
// possible to construct an operation that does a read, does a guard
// check (e.g., CMPXATTR), and then a write.  Then we either succeed
// with the write, or return a CMPXATTR and the read value.
…
if (ctx->op_t->empty() || result < 0) {
  …
  if (ctx->pending_async_reads.empty()) {
    complete_read_ctx(result, ctx);
  } else {
    in_progress_async_reads.push_back(make_pair(op, ctx));
    ctx->start_async_reads(this);
  }
  return;
}
…
// issue replica writes
ceph_tid_t rep_tid = osd->get_tid();

RepGather *repop = new_repop(ctx, obc, rep_tid);

issue_repop(repop, ctx);
eval_repop(repop);
```

As you can see, if we have any writes (all mutations end up in the
`op_t` transaction), we just flat out don't do the requested read
operations. If we don't have any writes, we perform the read
operations and return.  This is justified in the comment above because
of the non-deterministic behavior of resent read-write operations.

This is not an unsolved problem and we can bootstrap a solution on our
existing `Session` infrastructure.

## An upgraded session ##

Behold, `OSDSession`:
```c++
struct Session : public RefCountedObject {
  EntityName entity_name;
  OSDCap caps;
  int64_t auid;
  ConnectionRef con;
  WatchConState wstate;
  …
};
```

This structure exists once for every connection to the OSD. Where they
are created depends on who is doing the creation. In the case of
clients (what we're interested in) it occurs in `ms_handle_authorizeri`
```c++
…
isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
                                               authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);

if (isvalid) {
  Session *s = static_cast<Session *>(con->get_priv());
  if (!s) {
    s = new Session(cct);
    con->set_priv(s->get());
    s->con = con;
    dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
  }

  s->entity_name = name;
  if (caps_info.allow_all)
    s->caps.set_allow_all();
  s->auid = auid;
  …
}
```

In order to solve this problem, we propose a new data structure,
modelled on NFSv4.1
```c++
struct OpSlot {
  uint64_t seq;
  int r;
  MOSDOpReplyRef cached; // Nullable
  bool completed;
};
```

We do not want to give the OSD an unbounded obligation to hang on to
old message replies: that way lies madness. So, the additions to
`Session` we might make are:

```c++
struct Session : public RefCountedObject {
  …
  uint32_t maxslots; // The maximum number of operations this client
                     // may have in flight at once;
  std::vector<OpSlot> slots // The vector of in-progress operations
  ceph::timespan slots_expire; // How long we wait to hear from a
                               // client before the OSD is free to
                               // drop session resources
  cepu::coarse_mono_time last_contact; // When (by our measure) we
                                       // last received an operation
                                       // from the client.
};
```

## Message Additions ##

The OSD needs to communicate this information to the client. The most
useful way to do this is with an addition to `MOSDOpReply`.

```c++
class MOSDOpReply : public Message {
  …
  uint32_t this_slot;
  uint64_t this_seq;
  uint32_t max_slot;
  ceph::timespan timeout;
  …
};
```

This overlaps with the function of the transaction ID, since the
slot/sequence/OSD triple uniquely identifies an operation. Unlike the
transaction ID, this provides consistent semantics and a measure of
flow control.

To match our reply, the `MOSDOp` would need to be amended.
```c++
class MOSDOp : public Message {
  …
  uint32_t this_slot;
  uint64_t this_seq;
  bool please_cache;
  …
};
```

## Operations ##

### Connecting ###

A client, upon connecting to an OSD for the first time should send a
`this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
should use the `this_slot` and `this_seq` values from before it lost
its connection. If an OSD has state for a client and receives a
`(slot,seq) = (0,0)` then it should feel free to free any saved state
and start anew.

### OSD Feedback ###

In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
the value from the `MOSDOp` to which we're replying.

More usefully, the OSD can inform the client how many operations it is
allowed to send concurrently with `max_slot`. The client must **not**
send a slot value higher than `max_slot`. (The OSD should error if it
does.)

The OSD may increase the number of operations allowed in-flight
if it has capacity by increasing `max_slot`. If it finds itself
lacking capacity, it may decrease `max_slot`. If it does, the client
should respect the new bound. (The OSD should feel free to free the
rescinded slots as soon as the client sends another `MOSDOp` with a
slot value equal to one on which the new `max_slot` has been sent.)

If the client sends a `this_seq` lower than the one held for a slot by
the OSD, the OSD should error. If it is more than one greater than the
current `this_seq`, the OSD should error.

### Caching ###

The client is in an excellent position to know whether it **requires**
the output of a previous operation of mixed reads and writes on
resend, or whether it merely needs the status on resend. Thus, we let
the client set `please_cache` to request that the OSD store a
reference to the sent message in the appropriate `OpSlot`.

The OSD is in an excellent position to know how loaded it is. It can
calculate a bound on how large a given reply will be before executing
it. Thus, the OSD can send an error if the client has requested it
cache something larger than it feels comfortable caching.

Assuming no errors, the behavior, for any slot, is this: If the client
sends an `MOSDOp` with a `this_seq` one greater than the current value
of `OpSlot::seq`, that represents a new operation. Increment
`OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
the operation finishes, set `OpSlot::completed`. If `please_cache` has been
set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
result code in `OpSlot::r`.

If the client sends an `MOSDOp` with a `this_seq` equal to
`OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
will reply when it completes.) If it has completed, send the stored
`OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
with just `OpSlot::r`.

### Reconnection ###

Currently the `Session` is destroyed on reset and a new one is created
on authorization. In our proposed system the `Session` will not be
destroyed on reset, it will be moved to a structure where it can be
looked up and destroyed after `timeout` since the last message
received.

On connection, the OSD should first look up a `Session` keyed
on the entity name and create one if that fails.

# Read as a part of Transaction #

We don't have code examples here since most of the obvious interface
changes are obvious. Codes and parameters would be added to
`PGBackend::Transaction` and executing a transaction would have to
return data.

## Motivation ##

-   Mixed reads and writes are an efficiency win, since a client can
    save round trips by batching up operations in a single request.
    Current Ceph does not allow them for reasons which are quoted and
    addressed in the preceding section.
-   Mixed reads and writes are a semantic win. If an `MOSDOp` is
    atomic (it is in current Ceph), read-after-write can often remove
    the need for explicit locking.
-   Transactional reads may seem complicated, but the Erasure Coding
    backend already has to execute complex read transactions to
    reassemble or recover data. We want an asynchronous read capability
    in the Store anyway and there's no reason not to have it be shared
    with our asynchronous write path.
-   While it might seem that separating reads and writes, as we do
    now, allows us to simplify code and rule out edge cases, we would
    like to point out the existence of CLS, which can have problems if
    two method calls occur in the same `MOSDOp`.

## Sketch ##

The main problem with mixed read-write transactions is that replicas
need to write but not read. The key to handling this is dependency
checking. Outside CLS (which will be discussed below) it is very easy
to see whether reads and writes are independent. (Simply go down the
ops and see if their ranges overlap and whether getattrs and setattrs
have keys in common.) Reads coming after overlapping writes depend on
the previous writes. Then:
-   If an op that's all reads, simply do all the reads. We don't have
    to get write locks or anything.
-   If an op is all writes, it's no different than a replicated
    operation now.
-   For mixed reads and writes, if the reads aren't dependent on the
    writes, dispatch the writes and do the reads before, after, or
    concurrently with the writes on the primary.  (So long as we
    prevent writes from other transactions from intervening.)
-   Dependent reads are the difficult case. For erasure coding it
    shouldn't any difference since we'd have to dispatch reads and
    writes to all stripes anyway. For replication, we would want to
    execute the mixed read-write transaction on the local store in
    strict order and dispatch one consisting of only writes to the
    remotes.

# CLS #

## Current Problem ##

The CLS API works by making an ops vector and handing it to
`do_osd_ops`.

```c++
int cls_cxx_getxattr(cls_method_context_t hctx, const char *name,
                     bufferlist *outbl)
{
  ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
  bufferlist name_data;
  vector<OSDOp> nops(1);
  OSDOp& op = nops[0];
  int r;

  op.op.op = CEPH_OSD_OP_GETXATTR;
  op.indata.append(name);
  op.op.xattr.name_len = strlen(name);
  r = (*pctx)->pg->do_osd_ops(*pctx, nops);
  if (r < 0)
    return r;

  outbl->claim(op.outdata);
  return outbl->length();
}

int cls_cxx_setxattr(cls_method_context_t hctx, const char *name,
                     bufferlist *inbl)
{
  ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
  bufferlist name_data;
  vector<OSDOp> nops(1);
  OSDOp& op = nops[0];
  int r;

  op.op.op = CEPH_OSD_OP_SETXATTR;
  op.indata.append(name);
  op.indata.append(*inbl);
  op.op.xattr.name_len = strlen(name);
  op.op.xattr.value_len = inbl->length();
  r = (*pctx)->pg->do_osd_ops(*pctx, nops);

  return r;
}
```

The `do_osd_ops` function performs reads inline, synchronously, right
then and there for replicated pools. (Erasure coded pools are more
limited.) Writes are batched up and added to the transaction
associated with the current `OpContext`.

This is bad. If one has a CLS method that performs a read-modify-write
and one calls it twice in the same `MOSDOp`, it becomes a
read-modify-read-modify-write-write which may produce incorrect
results.

## Desiderata ##

-   CLS operations should be composable. We should be able to have many
    of them in a single operation.
-   They should remain transactional. If a CLS operation does some
    reads and hits an error, it stops and nothing is written to the
    store. We should not allow situations where a CLS method can write
    a partial result to the store then error.
-   They should be capable. We should not put too many restrictions on
    what an operation is allowed to do. It should be possible to run
    them on Erasure Coded Pools once ECOverwrite is in place. (At
    least some subset of them).
-   They should be consistent. A CLS operation should be able to call
    rand or generate a UUID without each replica holding a different
    value. (This rules out solutions like calling the method on each
    replica.)
-   They should be efficient and optimizable.
-   They should work in an asynchronous framework.

There are several ways we could change their implementation to address
these.

## Futures ##

This is an attractive way to think about CLS. It allows things to
proceed asynchronously and would solve the RMRMWW problem. One would
simply make every I/O operation in the CLS API a call returning a
future and write each method in continuation passing style. Executing
the transaction in the primary OSD (on a replicated pool) would create
a write-only that could then be sent to replicas. (Having the
execution of a CLS method also compile a write-only transaction is a
propery of any composable design.)
-   Tracking dependencies before the operation is executed would be
    problematic. There would be no way to know whether later reads
    overlapped with previous writes before doing them. This could lead
    to an unbounded obligation on the OSD to maintain state to
    evaluate OSDOp, including potentially large writes, before
    actually committing in order for CLS methods to remain
    transactional.

Futures are, on their own, insufficient to provide everything we need
from CLS, largely because they are opaque to the OSD. They could be
combined with…

## Pre-declaration ##

We could remove some of the generality. A simple way of doing so would
be to have methods declare, as a part of their signature, everything
that they may ever read or write, with the expectation that methods
will name the fewest resources required. This doesn't mean that every
method will always write to and read from everything it mentions,
merely that we have a known bound of the maximum it will ever use.

This makes analysis easier for the OSD, and in the composition case,
it could go in two passes. In the first, it would execute CLS calls
and pre-stage results and in the second it would pass its compiled
write transaction into the store.

This is the most attractive solution, but depends on pre-declaration
being done well on the resources used in pre-staging.

One could make things easier by being even more restrictive and
imposing ordering:
1.  The method declares in advance all read operations it might ever
    perform.
2.  The method declares in advance all write operations it might ever
    perform.
3.  The method examines the parameters passed by the client and
    indicates which subset of the named inputs and outputs it will
    use.
4.  The method performs its read operations and denotes exactly which
    output operations it will perform. (not the data to be output, but
    ranges and names.)
5.  The method performs write operations.

The most restrictive form of this would operate in two phases. First,
the CLS method would be presented with its parameters and all of the
things it plans to read or write (objects with ranges and attribute
keys.) In the second it would be called with the contents of all the
reads it requested and supply the data for all the writes it requested.

This would obviate the need for futures or other asynchronous I/O,
and make evaluation very easy. This approach would disallow some
operations, like indirecting through an attribute key to read another,
but is very appealing.

## Be Transactional ##

Our transactions are pre-checked and must succeed. If we want the most
expressive version of CLS consistent with our other goals, then we
should add commit and rollback. EC Overwrite will already require some
form of commit and rollback, so it's not beyond the realm of thought.

It could also be a foundation for some future multi-object-transaction
supporting backend.

This idea might have appeal on its own, but the concerns of CLS are
not sufficient to motivate it.

## Domain Specific Language ##

One could make a domain specific language, based on something simple,
that the OSD can execute to perform CLS methods. The OSD could then
analyze each method to see what I/O operations a method calls and try
to track them
-   Dependency tracking for compilers is a major area of research. It
    would be a whole lot of fun, but as a short term solution it is
    not really practical.
-   We still wouldn't be able to rule out problems in the general
    case.

This approach would be interesting as a long term academic research
project, but is not suitable for a short-range improvement.

# Flexible Placement #

This is a large topic which should be discussed on its own, but it
motivates the interface designs below, so we shall briefly mention why
it's interesting.

CRUSH/PG is a fine placement system for several workloads, but it has
two well-known limitations.

## Motivation ##

-   Data distribution can be much less uniform than one might like,
    giving uneven use of disks. This has caused some Ceph developers
    to experiment with Monte Carlo based placement algorithms.
-   Data distribution can be much more uniform than one would
    like. This is the fundamental cause of Ceph's slow sequential read
    performance. More generally, unrelated workloads contend
    with each other due to a lack of affinity for related data. The effects are
    especially pronounced on spinning disk (due to seek times), but
    still exist on Flash (due to bus/network contention.)  This is a
    tension between competing goods. CRUSH gains wide dispersion and
    uniformity to defend against correlated failures but this imposes
    a tradeoff.

## Goal ##

Ceph should support placement methods other than CRUSH/PG. Currently,
the OSD dispatches operations based on placement group ID, which will
need to be varied,

We also need some way to get new types of functions into the cluster.

## Proposal ##

Our proposal is, in a way, CRUSH taken to its logical
conclusion. Instead of distributing CRUSH rules, we propose to
distribute general computable functions from (oid, volume/dataset) pairs to
sequences of OSDs with their supporting data structures.  One of our
ongoing research projects has been an in-process executor for these
functions based on Google's NaCl. The benefits are:
-   Administrators can fine-tune placement functions to fit their
    workloads well.
-   They can also experiment easily without having to recompile all of
    Ceph and make heavy architectural changes.
-   Entirely new placement strategies can be deployed without having
    to upgrade every machine in the cluster. Or any machine in the
    cluster, once they've been upgraded to a Flexible Placement
    capable version.
-   Possibilities for annealing and machine learning to gradually
    adapt placement in response to load data become available
-   NaCl builds on LLVM which has a rich set of tools for optimizations
    like partial evaluation.
-   NaCl is fast.

# Flexible Semantics #

Another motivating example. Originally, Ceph did replication and only
replication under a very specific consistency model. There has been
desire for more flexibility.
-   Erasure Coding. it still follows the Ceph consistency model
    (though leaves out many operations) but is very different in
    back-end dispatch, enough so that it inspired a major rewrite of
    the OSD's bottom half.
-   Append-only immutable objects have been discussed.
-   Many people have asked for relaxed consistency to improve
    performance. This is not be suitable for all workloads, but people
    have repeatedly asked for the ability to set up low-latency,
    relaxed-consistency volumes that still provide Ceph's ability to
    easily use new storage and scale well.
-   Transactional storage. As mentioned above, cross-object
    transactional semantics are a thing people may have desired.

# Interfaces #

Right now our class hierarchy is a bit of a mess. Eventually we'll do
something about `PG` and `ReplicatedPG`, refactor, support
asynchronous I/O, reduce lock contention, support in core affinity,
and build Jerusalem here in England's green and pleasant land.

While we're stringing up our bows of burning gold, we should support
non-PG based placement and flexible semantics. Right now, parts of the
PG and the OSD (since the OSD manages the collection of PGs, spins
them up, and manages thread pools shared by sets of PGs) are
intertwined. Thus, we need to abstract out both pieces.

As we also want to support having multiple "logical" OSDs running in a
single `ceph-osd` process, this would be a natural time to add that
capability.

Both these are sketches and should be considered a work in progress.

## `DataSetInterface` ##

Here is a sketch of what a flexible abstraction based on PG could look
like, at least parts of one. Not being informed about Scrub,
Recovery, or Cache Tiering, having only focused on the object
operation path, we won't include those details here.

We also leave out functions called from the PG itself or other objects
invoked from ownstack.

```c++
class DataSetInterface {
protected:
  LogicalOSD& losd; // LogicalOSD is a means to have different
                    // stores/semantics run in the same process.

  MapPartRef curmap; // Subset of map relevant to this DSI
public:
  // The OSD (things Up the Stack, generally) should not call 'lock'
  // on us. If we have locking of some sort things down the stack that
  // we have some relationship with (friend or whatever) could lock or
  // unlock us, but that should not be baked in as part of the interface.

  // Things like the info struct and details about loading the Place
  // wouldn't actually be here. As there is an intimate relation
  // between the LogicalOSD and an implementation of DataSetInterface (it
  // holds all those loaded in memory and controls dispatch), they
  // would not need to be part of the generic interface.

  const coll_t coll; // The subdivision of the Store we control

  // In the PG case we always know we're the primary or not for
  // anything within the same pgid. That is not expected to be the
  // case generally.
  bool is_primary(const OpRequest&) = 0;
  // No 'is_replica' since 'replica' may not be applicable
  // generally. It's a bit off even in the erasure coded case.
  bool is_acting(const OpRequest&) = 0;
  bool is_inactive() = 0;

 public:
  // No identifier. The descendent will take that.
  DataSetInterface(LogicalOSD& o, OSDMapRef curmap);
  virtual ~DataSetInterface();

  DataSetInterface(const DataSetInterface&) = delete;
  DataSetInterface& operator =(const DataSetInterface&) = delete;
  DataSetInterface(DataSetInterface&&) = delete;
  DataSetInterface& operator =(DataSetInterface&&) = delete;

  virtual void on_removal(ObjectStore::Transaction *t) = 0;

  // Yes, there's no 'queue' and no 'do_op' or any of
  // that. This is intentional. There's no dequeue or do_op because
  // those functions are either called only by the PG currently OR
  // they're called in OSD functions called by the PG as part of the
  // thread switch. They should not be part of the public interface.

  // There's no queue because we can either put queue here or we can
  // put queue in LogicalOSD. (We could do both, but that seems bad to
  // me.) If there is some combination of locking and checking that
  // must be done before queueing an operation, it seems that it's
  // better to do it in LogicalOSD so that it doesn't leak out and
  // become part of the abstraction for other implementations.
};
```

## `LogicalOSD` ##

The OSD class itself (representing the single OSD process) should have
a map (*perhaps* a Boost.Intrusive.Set?) mapping OSD IDs to to
`LogicalOSD` instances.

```c++
class LogicalOSD {
  OSD& osd;
  ObjectStore& store;

  // Look up the DataSetInterface instance appropriate to the given
  // OpRequest.
  virtual future<DataSetInterface,int> get_place_for(const OpRequest&) = 0;

  // Every logical OSD will have its own watchers as well as slot
  // cache. Someone familiar with flow control should check this
  // idea. Since LogicalOSDs will, ideally, share messengers we might
  // want them to share the same slot cache. In that case we should
  // just re-dimension watchers within Session
  SessionRef session_for(const entity_name_t& name);

  void queue(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);
  void queue_front(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);

  // Dequeue and the like are currently called in the PG itself and so
  // have no place in the interface presented to the OSD.

  void pause();
  void resume();
  void drain();
};
```

## Library ##

Both these interfaces are quite thin and intentionally so. Scrubbing
and recovery have not been addressed at all, as mentioned, so those
parts will be expanded.  Asynchrony should allow us simpler interfaces
since some complexity of requeing will be handled by futures and
continuations.

We obviously do not want to rewrite all our existing code. Instead
most of the existing work on `PG` and `ReplicatedPG` should be
refactored into a templated library from which implementations of
`LogicalOSD` and `DataSetInterface` can be constructed.

# Map Partitioning #

There are two huge problems with scalability in Ceph.
1.  The OSDMap knows too many things
2.  A single monitor manages all updates of everything and replicates them to
    other monitors.

## Too Big to Not Fail ##

The monitor map and MDS maps are fine. Each holds data needed to
locate servers and that's it. It would be very hard to put enough data
in them to cause problems. The OSD map however contains a trove of data that
must be updated serially in Paxos and propagated to every OSD,
monitor, MDS, and client in the cluster.

Pools are a notorious example. We can't create as many pools as users
would like. Pools are heavyweight, and while they depend on other
items in the OSD map (like erasure code profiles), it would be nice if
we divide them between several monitor clusters, each of which would
hold a subset of pools. We would need to make sure that clients had up
to date versions of whatever pools they are using along with the
status of the OSDs they're speaking to, but that's not
impossible. Likewise, we should split placement rules out of the OSD
map, especially once we get into larger numbers of potentially larger
Flexible Placement style functions.

Nodes should then only need to subscribe to the set of pools and
placement functions they need to access their data. Changes like these
should allow users to create the number of pools they want without
causing the cluster difficulty.

### Consistency ###

Partitioning makes consistency harder. A simple remedy might be to
stop referring to data by name or integer. An erasure code profile
should be specified by UUID and version. So should pools and placement
functions. When sending a request to the OSD, a client should send the
versions of the pool, the ruleset, and the OSDMap it used and the OSD
should check that all three are current.

## The OSD Set ##

The complicating case here is the OSD status set.  Running this
through a single Paxos limits the number of OSDs that can coexist in a
cluster.  We ought split the set of OSDs between multiple masters to
distribute the load. Each 'Up' or 'Down' event is independent of
others, so all we require is that events get propagated into the
correct OSDs and primaries and followers act as they're supposed to.

Versioning is a bigger problem here. We might have all masters
increment their version when one increments its version if that could
be managed without inefficiency. We might send a compound version with
`MOSDOp`s, but combining that with the compound version above might be
unwieldly. (Feedback on this issue would be greatly appreciated.)

### Subscription ###

For a large number of OSDs, it would be nice if not everyone were
notified of all state changes.

For a pool whose placement rule spans only a subset of all OSDs,
clients using that pool should be able to subscribe to a subset of the
OSD set corresponding to that pool. This should be fairly easy so long
as the subset is explicit.

In the case of pools not providing an explicit subset, a monitor (or
perhaps a proxy in front of a set of monitors) could look at common
patterns of subscription requests and merge those with significant
overlap together, so as to give clients a subset without being
destroyed by the irresistible force of combinatorial explosion.

# Notes #

These are notes taken when reviewing the code and thinking out
ideas. You don't have to read them, but they are provided as a
supplement in case you wanted to know what we were thinking and why.

## ShardedOpWQ ##

-   What is the purpose of `sdata_op_ordering_lock`? A shard is not a
    PG, so why do things need to be ordered within shards as well as
    within PGs?
-   `sdata_lock` pairs up with the condition variable

## OSD Upper Half ##

### Regular Dispatch ###

-   Does not overlap with `fast_dispatch`. Operations in
    `ms_can_fast_dispatch` are not handled in `_dispatch` and vice versa.
-   Lock the entire OSD
-   If another dispatch is executing, go to sleep and wait for it to
    finish. What the heck?
-   Do Waiters
    * Waiters are a list of `OpRequestRef`s called `finished` for some
          reason
    * Whenever we activate an `OSDMap` the requests waiting for the
      map get put onto 'finished'
-   Call `_dispatch`
-   Do some more waiters
-   Wake up other dispatch threads
-   Unlock the entire OSD

#### `_dispatch`? ####

A giant case statement that does a bunch of things.

In the case of `OSDOp`, if we have an `OSDMap`, create an `OpRequest` and
pass it to `dispatch_op`. This is for things like PG commands, not
actual object operations.

#### `dispatch_op` ####

Another giant case statement. 

### Fast Dispatch ###

#### `ms_fast_preprocess` ####

Update the map epoch if an OSD sends us an OSDMap.

#### `ms_fast_dispatch` ####

-   Make an `OpRequest`
-   A bit weird and convoluted, it looks like we use the 'op waiting
    for map' stuff to queue up an op on a reserved map and remove the
    reservation preventing it from running before we return.
-   Specifically we mark the op as waiting for its PG in the `Session`
    and then mark the `Session` as waiting for the new map.
-   Ultimately things end up in `dispatch_op_fast`

#### `dispatch_op_fast` ####

Shovels operations into type specific calls like…

#### `handle_op` ####

-   Set up map share (if needed)
-   Calculate the True PGID and Pool (sanity check against the client?)
-   Either get the pointer to the PG (a base class) or, if it hasn't
    been loaded in, queue the session to wait for it
-   If we have the PG, `enqueue_op` (which just calls `PG::queue_op`)

## OSD Lower Half (Currently PG) ##

`ReplicatedPG` and `PG` are separate for historical reasons and actual
differentiation occurs in choice of backend according to Word of Sam.

PGs with different consistency properties are explicitly a goal
now. The idea of a `PGInterface` has been floated to facilitate their
creation and `ReplicatedPG` would become a child of that.

### `PG::queue_op` ###

-   Delay if other people are waiting for maps (to preserve the PG Ordering)
-   Enqueue in `op_wq` (owned by the OSD)
-   (Why call into the PG then? Just to enforce the map ordering?)
-   The work queue gathers operations which, during `_process` are later
    reassembled into a list of work to be done.
-   `_process` is called by a worker thread in the thread pool, so the
    call to `dequeue_op` is in worker thread. Since it's sharded, we
    get multiple groups of threads each serving some subset of PGs.

### `OSD::dequeue_op` ###

-   After a bit of fiddling about sharing maps, call `PG::do_op`

### `PG::do_op` ###

-   Sam says he plans to rewrite this to allow read asynchrony
-   We want to see reads and writes share the same transaction
    structure and similar semantics.
-   We also want to allow reads and writes in the same operation and
    to use a session mechanism to allow that.
-   We'll need transaction transforms to, for example, filter out
    reads before sending an operation to a replicating OSD. This
    shouldn't be too hard, since the output of read operations can't
    be the input for write operations. (Except in CLS?)
-   `do_op` is a virtual function, but the only implementation is in
    ReplicatedPG.
-   Here looks to be where we apply ordering to Writes
-   `execute_ctx` actually performs the operations after `do_op` has set
    everything up

### `execute_ctx` ###

-   May be called multiple times on the same `OpContext`
-   In the case of clone operations (that's the only thing that takes
    `src_obc`?), get a read-lock for the object context
-   it's called `ondisk` but I'm not sure why, it doesn't look like they get serialized
-   Then we have a brief detour into `prepare_transatcion`
-   Here's the read-write restriction. ReplicatedPG.cc:2975. Later we can
    create a better session abstraction to fix that.
-   For reads
    *   `do_osd_op_effects`!
    *   If all our reads were synchronous (or there were none)
        `complete_read_ctx`, which creates and sends the reply
    -   Otherwise, `start_async_reads`, which passes the pending reads off
        to `objects_read_async`
    -   Once the backend completes, we go to `finish_read`, which calls
        `complete_read_ctx`
-   Trim the PG Log
-   Hey, cool, there's a lambda! Register an `onack` closure that sends a reply
-   And `oncommit`. And `onsuccess`. And `finish`.
-   Package up the `OpContext` and its transaction and whatnot into a
    `RepOp`. This is where all the mutations get done.
-   Call `issue_repop`
-   Call `eval_repop`
-   Adam really wishes we would use `boost::intrusive_ptr` everywhere
    and stop using explicit gets and puts.

### `prepare_transaction` ###

-   `do_osd_ops`!
-   If we're not full, `finish_ctx`.

### `do_osd_ops` ###

-   Loop over the ops in a gigantic case statement
-   If we hit any modification ops set the `user_modify` flag. This is
    used to update the object version as part of the transaction
-   On EC pools, do reads asynchronously, pushing them onto a list of
    reads to complete.
-   Otherwise do the reads synchronously
-   CLS calls can be tricky since they read or write depending on the
    method invoked
-   It looks like operations performed by CLS are done by calling each
    operation individually with `do_osd_ops` with reads being done
    immediately and writes being queued up as part of the transaction
-   Making the CLS API futures-based interface may be a good thing to do.
-   Cache ops like flushing seem to be about shovelling triggers to do
    perform actions into the `onack`/`oncomplete` lists.
-   For write operations, stuff them into the Transaction
-   In the case of CLS operations which do both reads and writes
    (which some of them do), it appears that putting two CLS operations
    in the same OSDOp might lead to weird results since all the reads
    will happen then all the writes.

### `finish_ctx` ###

-   Fiddle with object state and logs to update snapshot foo and to make
    sure the object exists in the form we need it
-   Update user version if we modified the object
-   Save the updated `object_info_t`
-   Append the updated object info to the `PGLog`
-   Apply context stats

### `do_osd_op_effects`

-   Add watches if we need to add watches
-   If there's notifies, notify the watchers
-   Why do we ack notifies?

### `issue_repop` ###

-   Acquire locks (I'm still not clear why they're called `ondisk`. Is
    it a lock acquired to use the store and thus it locks the on-disk
    representation?)
-   Apply built up attributes (likely verions and things that had been
    stuck in the PGLog before.)
-   Submit transaction to the PG Backend. Which is where it gets
    divided up for Erasure Coding or sent out for replication. I'll
    count that as Bottom End for the moment alongside the Store,
    Changes to the backend will be for new consistency models.

    We might be able to get a separation of concerns by varying what
    is now ReplicatedPG to support differnet 'gridding' of objects on
    the OSD and rejigger things so the consistency model is purely a
    property of the backend. That's appealing from a maintenance
    perspective, but breaks down if we want things like explicitly
    marked transactions across multiple for some volumes while not
    paying for them on others. It might not be workable in the general
    case.
-   That's also where local application takes place.

### `eval_repop` ###

-   This function just sends notifications and cleans up when we finish.
-   Its name is not very appropriate for what it does.
-   If we're already done, return.
    *   This isn't bad, but it's specifically necessary because `eval_repop`
        gets called from several places including the handlers for our
        subservient OSDs completing an operation.
-   If everyone's ack'd, fire off our ack handlers. If everyone's
    completed, fire off our completion handlers.
-   Notify anyone waiting for the version we've committed…
-   And for those waiting on the one we've applied
-   If we've done everything, update usage stats
    *   Fire off `on_success` callbacks
    *   Remove ourselves

## Flex Points ##

### PlacementGroup/FlexiblePlacement/OtherConsistencyStrategy ###

-   Fast Dispatch currently shoves requests into a PG.
-   `handle_op` calculates a pgid and actually gets the pointer to or
    queues the session to wait on the associated PG
-   If we implement `queue_op` in FlexiblePlacement we can do whatever we
    want with it. We can ignore the WorkQueue.
-   Much of the code in `ReplicatedPG` is useful even with other
    semantic models than PG-ordered replication
-   We might want to make `ReplicatedPG` a template and
    supply the `PG` specific parts as a class instantiation. Then we
    could create more classes for other partition/dispatch models.
-   We will want a consitency/semantic variation orthogonal to the
    partition/dispatch model.
        * In this divide dividing objects into PGs where every all
          operations are dispatched into the PG for whatever objet they
      effect would be partition/gridding
    * Whereas the total ordering on PG operations and constraints on
          when a request blocks versus being served are the
          consistency/semantics

### Allocation/Locking/Dispatch ###

-   `OpRequest` (currently allocated in `do_op` and other structures
    might be allocated at various points. IN our earlier prototype we
    allocated OpRequest and another structure alongside the MOSDOp and
    reused MOSDOps rather than deallocating them to cut down on
    allocator use in the fast path.

        That might fight with also promising designs using core-affine
    memory management, unless we can determine core affinity quickly
    before allocating the message. (Maybe peeking into the undecoded
    bytes?)
-   Lock freedom should be orthogonal to flexible placement. There may
    be situations where we want lockful systems in flexible placement
    (since flexible placement can have a variety of sync behaviors.)
    and we know that Sam and others are interested in pursuing
    lock-free designs in in PG-placement.
-   In a lock-free design, if PGs are core-affine,
    `enqueue_op` could just submit a message to a core without locking
    or some of the thread/worker complexity.
-   For Volumes, where the volume itself may be partitioned across cores
    `enqueue_op` would have to look at the object name to find its target.
-   Thus, we would want to pull that logic into a separate function
    giving our dispatch target.

### Read-Write Symmetry ###

-   Thankfully, `init_op_flags` is happy to set both read and write
-   CLS in particular falls afoul of this. Futures might be the best
    way to deal with it.

### Things we know we had to do anyway from previous work ###

-   Use `std::map` less as a parameter/return type, same for std::set
-   Objecter improvements
    *   Less allocation, change data structures. A dual to some of the
        work we want to do to make the EC interface less memory
        intensive.
    -   If we have zero copy there should be a way to materialize that
        at the level of the client.
-   See about bootstrapping client-side EC from EC overwrite
-   Librados4 should be more like Objecter than it is like librados3

## Sam and `do_op` (♪ Doo-Wop? ♪) ##

### Discussion ###

Notes taken during a BlueJeans call between Adam Emerson and Sam
Just. (Sorry for any mistakes, recording a conversation while having
it is tricky.)

-   We should never have to block for I/O
-   It's not `do_op` per se, though we are rewriting that to put it into a
    continuation passing style with trampolines
-   Various bits should be allowed to block, but whether they do or
    don't should not effect the caller's code-flow.
-   Once we've got to that point, everything after is easier
-   We have to make sure we don't introduce so much overhead that it's measurable
-   Eventually plans to go to a lock-free/sharded/partitioned style like Seastar
-   We are not using Seastar's system because, when you fulfil a
    promise you don't want to have the promise fulfilled in that
    thread, it should be easy to fulfill it in a different thread.
-   Also adapting an existing codebase to Seastar is much harder than
    writing one from scratch to use it.        
-   It should also allow us to run all the OSDs in the same process
-   We might want to have one messenger per logical OSD and have those share
    threads (loses some efficiency gains but is backwards compatible.)
-   These sorts of changes will also make EC overwrites much easier.
-   Any refactors in the code should move us in this direction as a side effect
-   The sooner the better, so if it does cause performance problems we
    can find out soon
-   Branch is wip-do-op in athanatos

### Brief Exploration of the code ###

Adam Emerson looked briefly through the `wip-do-op` branch in
`https://github.com/athanatos/ceph.git` to see what the general design
looked like and how it matched up with our goals.

-   Getting rid of the 'ondisk lock' looks good, someone good at
    scheduling (Matt?) should review the queue. It should not use
    `std::list`, though.
-   The `do_replica_safe_reads` refactor isn't bad but doesn't seem to
    have an immediate effect. Sam described it as providing safety
    shunting things replicas could do into their own function, so
    should make future development and refactor easier.
-   It reinforces the idea that reads inhabit a separate magisterium
    with its own law and dispensation from writes and is the oposite
    direction from the read/write transactions we want. At least
    potentially, we could use it as a fast/safe path and have it do a
    more specialized transaction dispatch for reads, maybe.
-   The `do_op`/`do_replica_op` split seems reasonable for the
    replicated case, since in that one we want to transform the
    transaction before sending it to the replicas. If we want to allow
    CLS methods on EC pools (which we do, in principle) or mixed
    read-write, then the distinction between primary and replica might
    break down.
-   Not sure if the error channel is better pe se, but since we
    currently have a bunch of functions that return `int` to indicate
    errors, it might be easier to integrate.
-   C++ should have a `void` type a bit more like unit so you could
    explicitly return `void()` from void functions. You'd think they
    could put *that* in C++17 since their list of things to add to the
    standard now consists entirely of "3 to the version number".
-   The `future` implementation looks promising. I'll need to review
    how it's put together in more detail later, how it's used is more
    pertinent at the moment.
-   Things make sense from a gradualist position. Given the desire for
    a progression from from here to _A Really Fast OSD_ where we have
    _A Working OSD_ at every point along the way, this approach makes
    sense. Restructuring everything around a blocking-agnostic futures
    design then opens the way to introducing asynchronous, lock-free code.
-   This is also compatible with flexing, since we can have multiple
    `LogicalOSD` implementations with different locking strategies or
    core affinity.
-   `aio_read` looks to be less aio than the name would suggest. This
    isn't bad, it's reasonable to do a transform by having things call
    blocking procedures in a way that will work if they become non-blocking.
-   Reimplementing the blocking calls in terms of nonblocking calls is
    smart.
-   `OSDReactor` looks like it could be adapted, at least the public
    interface, into LogicalOSD once we made it less PG specific.
-   In principle it's a good idea. A LogicalOSD would have to be bound
    closely to the DataSetInterface it worked with since they're two
    halves of a queueing mechanism.
-   The futures stuff definitely isn't naïve. We need to understand
    the blockers and other details.  The idea of having a future yield
    when it needs to wait for something is a good one.
-   It uses `std::list` though.

## Why librados is not wonderful ##

Not that we hate RADOS, we just like Objecter way more
-   Does not support read and write in same op. Neither does RADOS, to
    be fair, but we plan to fix that.
-   Takes a giant lock with every operation. Yuck.
-   Has its own 'callback' interface
-   Its handing of asynchronous operations seems very heavyweight and
    not natural.
-   Hides the internal structure of RADOS operations
-   Does not expose object locator in a useful way
-   Does way too many allocations
-   The dimensioning of the interface is weird, like binding the IoCtx
    to a pool
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
  2016-04-15 21:05 Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Adam C. Emerson
@ 2016-04-15 21:25 ` Milosz Tanski
  2016-04-15 22:12   ` Adam C. Emerson
  2016-04-15 22:09 ` Gregory Farnum
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Milosz Tanski @ 2016-04-15 21:25 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

On Fri, Apr 15, 2016 at 5:05 PM, Adam C. Emerson <aemerson@redhat.com> wrote:
> Ceph Developers,
>
> We've put together a few of the main ideas from our previous work in a
> brief form that we hope people will be able to digest, consider, and
> debate. We'd also like to discuss them with you at Ceph Next this
> Tuesday.
>
> Thank you.
>
>
> ---8<---
>
>
> We have been looking at improvements to Ceph, particularly RADOS,
> while focusing on flexibility (allowing users to do more things)
> and performance. We have come up with a few proposals with these two
> things in mind. Sessions and read-write transactions aim to allow
> clients to batch up multiple operations in a way that is safe and
> correct, while allowing clients to gain the advantages of atomic
> read-write operations without having to lock. Sessions also provide
> a foundation for flow-control which ultimately improves performance
> by preventing an OSD from being ground into uselessness under a
> storm of impossible requests. The CLS proposal is a logical follow-on
> from the read-write proposal, as we attempt to address some problems
> of correctness that exist now and consider how to integrate the
> facility into an asynchronous world.
>
> Flexible Placement, as you would expect from the name, is about
> allowing users more control, as are Flexible Semantics. They both
> have profound performance implications, as tuning placement to better
> match a workload can increase throughput, and relaxed consistency can
> decrease latency. The proposed Interfaces are meant to support both as
> well as work currently being done to allow an asynchronous OSD and to
> hide details like locking and thread pools so that backends can be
> written with different forms of concurrency and load-balancing
> across processors.
>
> Finally, Map Partitioning is not directly related to code paths within
> the OSD itself, but does affect everything that can be done with Ceph.
> People are beginning to run into limits on how large a Ceph cluster can
> grow and how many ways they can be partitioned, and both these problems
> fundamentally derive from the way the OSD map is handled by the monitors.
>
> There are also some notes at the end. They are not critical, but if you
> find yourself asking "What were they thinking?" the notes might help.
>
> # Sessions and Read-Write #
>
> From `ReplicatedPG.cc`.
>
> ```c++
> // Write operations aren't allowed to return a data payload because
> // we can't do so reliably. If the client has to resend the request
> // and it has already been applied, we will return 0 with no
> // payload.  Non-deterministic behavior is no good.  However, it is
> // possible to construct an operation that does a read, does a guard
> // check (e.g., CMPXATTR), and then a write.  Then we either succeed
> // with the write, or return a CMPXATTR and the read value.
> …
> if (ctx->op_t->empty() || result < 0) {
>   …
>   if (ctx->pending_async_reads.empty()) {
>     complete_read_ctx(result, ctx);
>   } else {
>     in_progress_async_reads.push_back(make_pair(op, ctx));
>     ctx->start_async_reads(this);
>   }
>   return;
> }
> …
> // issue replica writes
> ceph_tid_t rep_tid = osd->get_tid();
>
> RepGather *repop = new_repop(ctx, obc, rep_tid);
>
> issue_repop(repop, ctx);
> eval_repop(repop);
> ```
>
> As you can see, if we have any writes (all mutations end up in the
> `op_t` transaction), we just flat out don't do the requested read
> operations. If we don't have any writes, we perform the read
> operations and return.  This is justified in the comment above because
> of the non-deterministic behavior of resent read-write operations.
>
> This is not an unsolved problem and we can bootstrap a solution on our
> existing `Session` infrastructure.
>
> ## An upgraded session ##
>
> Behold, `OSDSession`:
> ```c++
> struct Session : public RefCountedObject {
>   EntityName entity_name;
>   OSDCap caps;
>   int64_t auid;
>   ConnectionRef con;
>   WatchConState wstate;
>   …
> };
> ```
>
> This structure exists once for every connection to the OSD. Where they
> are created depends on who is doing the creation. In the case of
> clients (what we're interested in) it occurs in `ms_handle_authorizeri`
> ```c++
> …
> isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
>                                                authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);
>
> if (isvalid) {
>   Session *s = static_cast<Session *>(con->get_priv());
>   if (!s) {
>     s = new Session(cct);
>     con->set_priv(s->get());
>     s->con = con;
>     dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
>   }
>
>   s->entity_name = name;
>   if (caps_info.allow_all)
>     s->caps.set_allow_all();
>   s->auid = auid;
>   …
> }
> ```
>
> In order to solve this problem, we propose a new data structure,
> modelled on NFSv4.1
> ```c++
> struct OpSlot {
>   uint64_t seq;
>   int r;
>   MOSDOpReplyRef cached; // Nullable
>   bool completed;
> };
> ```
>
> We do not want to give the OSD an unbounded obligation to hang on to
> old message replies: that way lies madness. So, the additions to
> `Session` we might make are:
>
> ```c++
> struct Session : public RefCountedObject {
>   …
>   uint32_t maxslots; // The maximum number of operations this client
>                      // may have in flight at once;
>   std::vector<OpSlot> slots // The vector of in-progress operations
>   ceph::timespan slots_expire; // How long we wait to hear from a
>                                // client before the OSD is free to
>                                // drop session resources
>   cepu::coarse_mono_time last_contact; // When (by our measure) we
>                                        // last received an operation
>                                        // from the client.
> };
> ```
>
> ## Message Additions ##
>
> The OSD needs to communicate this information to the client. The most
> useful way to do this is with an addition to `MOSDOpReply`.
>
> ```c++
> class MOSDOpReply : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   uint32_t max_slot;
>   ceph::timespan timeout;
>   …
> };
> ```
>
> This overlaps with the function of the transaction ID, since the
> slot/sequence/OSD triple uniquely identifies an operation. Unlike the
> transaction ID, this provides consistent semantics and a measure of
> flow control.
>
> To match our reply, the `MOSDOp` would need to be amended.
> ```c++
> class MOSDOp : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   bool please_cache;
>   …
> };
> ```
>
> ## Operations ##
>
> ### Connecting ###
>
> A client, upon connecting to an OSD for the first time should send a
> `this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
> should use the `this_slot` and `this_seq` values from before it lost
> its connection. If an OSD has state for a client and receives a
> `(slot,seq) = (0,0)` then it should feel free to free any saved state
> and start anew.
>
> ### OSD Feedback ###
>
> In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
> the value from the `MOSDOp` to which we're replying.
>
> More usefully, the OSD can inform the client how many operations it is
> allowed to send concurrently with `max_slot`. The client must **not**
> send a slot value higher than `max_slot`. (The OSD should error if it
> does.)
>
> The OSD may increase the number of operations allowed in-flight
> if it has capacity by increasing `max_slot`. If it finds itself
> lacking capacity, it may decrease `max_slot`. If it does, the client
> should respect the new bound. (The OSD should feel free to free the
> rescinded slots as soon as the client sends another `MOSDOp` with a
> slot value equal to one on which the new `max_slot` has been sent.)
>
> If the client sends a `this_seq` lower than the one held for a slot by
> the OSD, the OSD should error. If it is more than one greater than the
> current `this_seq`, the OSD should error.
>
> ### Caching ###
>
> The client is in an excellent position to know whether it **requires**
> the output of a previous operation of mixed reads and writes on
> resend, or whether it merely needs the status on resend. Thus, we let
> the client set `please_cache` to request that the OSD store a
> reference to the sent message in the appropriate `OpSlot`.
>
> The OSD is in an excellent position to know how loaded it is. It can
> calculate a bound on how large a given reply will be before executing
> it. Thus, the OSD can send an error if the client has requested it
> cache something larger than it feels comfortable caching.
>
> Assuming no errors, the behavior, for any slot, is this: If the client
> sends an `MOSDOp` with a `this_seq` one greater than the current value
> of `OpSlot::seq`, that represents a new operation. Increment
> `OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
> the operation finishes, set `OpSlot::completed`. If `please_cache` has been
> set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
> result code in `OpSlot::r`.
>
> If the client sends an `MOSDOp` with a `this_seq` equal to
> `OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
> will reply when it completes.) If it has completed, send the stored
> `OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
> with just `OpSlot::r`.
>
> ### Reconnection ###
>
> Currently the `Session` is destroyed on reset and a new one is created
> on authorization. In our proposed system the `Session` will not be
> destroyed on reset, it will be moved to a structure where it can be
> looked up and destroyed after `timeout` since the last message
> received.
>
> On connection, the OSD should first look up a `Session` keyed
> on the entity name and create one if that fails.
>
> # Read as a part of Transaction #
>
> We don't have code examples here since most of the obvious interface
> changes are obvious. Codes and parameters would be added to
> `PGBackend::Transaction` and executing a transaction would have to
> return data.
>
> ## Motivation ##
>
> -   Mixed reads and writes are an efficiency win, since a client can
>     save round trips by batching up operations in a single request.
>     Current Ceph does not allow them for reasons which are quoted and
>     addressed in the preceding section.
> -   Mixed reads and writes are a semantic win. If an `MOSDOp` is
>     atomic (it is in current Ceph), read-after-write can often remove
>     the need for explicit locking.
> -   Transactional reads may seem complicated, but the Erasure Coding
>     backend already has to execute complex read transactions to
>     reassemble or recover data. We want an asynchronous read capability
>     in the Store anyway and there's no reason not to have it be shared
>     with our asynchronous write path.
> -   While it might seem that separating reads and writes, as we do
>     now, allows us to simplify code and rule out edge cases, we would
>     like to point out the existence of CLS, which can have problems if
>     two method calls occur in the same `MOSDOp`.
>
> ## Sketch ##
>
> The main problem with mixed read-write transactions is that replicas
> need to write but not read. The key to handling this is dependency
> checking. Outside CLS (which will be discussed below) it is very easy
> to see whether reads and writes are independent. (Simply go down the
> ops and see if their ranges overlap and whether getattrs and setattrs
> have keys in common.) Reads coming after overlapping writes depend on
> the previous writes. Then:
> -   If an op that's all reads, simply do all the reads. We don't have
>     to get write locks or anything.
> -   If an op is all writes, it's no different than a replicated
>     operation now.
> -   For mixed reads and writes, if the reads aren't dependent on the
>     writes, dispatch the writes and do the reads before, after, or
>     concurrently with the writes on the primary.  (So long as we
>     prevent writes from other transactions from intervening.)
> -   Dependent reads are the difficult case. For erasure coding it
>     shouldn't any difference since we'd have to dispatch reads and
>     writes to all stripes anyway. For replication, we would want to
>     execute the mixed read-write transaction on the local store in
>     strict order and dispatch one consisting of only writes to the
>     remotes.
>
> # CLS #
>
> ## Current Problem ##
>
> The CLS API works by making an ops vector and handing it to
> `do_osd_ops`.
>
> ```c++
> int cls_cxx_getxattr(cls_method_context_t hctx, const char *name,
>                      bufferlist *outbl)
> {
>   ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
>   bufferlist name_data;
>   vector<OSDOp> nops(1);
>   OSDOp& op = nops[0];
>   int r;
>
>   op.op.op = CEPH_OSD_OP_GETXATTR;
>   op.indata.append(name);
>   op.op.xattr.name_len = strlen(name);
>   r = (*pctx)->pg->do_osd_ops(*pctx, nops);
>   if (r < 0)
>     return r;
>
>   outbl->claim(op.outdata);
>   return outbl->length();
> }
>
> int cls_cxx_setxattr(cls_method_context_t hctx, const char *name,
>                      bufferlist *inbl)
> {
>   ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
>   bufferlist name_data;
>   vector<OSDOp> nops(1);
>   OSDOp& op = nops[0];
>   int r;
>
>   op.op.op = CEPH_OSD_OP_SETXATTR;
>   op.indata.append(name);
>   op.indata.append(*inbl);
>   op.op.xattr.name_len = strlen(name);
>   op.op.xattr.value_len = inbl->length();
>   r = (*pctx)->pg->do_osd_ops(*pctx, nops);
>
>   return r;
> }
> ```
>
> The `do_osd_ops` function performs reads inline, synchronously, right
> then and there for replicated pools. (Erasure coded pools are more
> limited.) Writes are batched up and added to the transaction
> associated with the current `OpContext`.
>
> This is bad. If one has a CLS method that performs a read-modify-write
> and one calls it twice in the same `MOSDOp`, it becomes a
> read-modify-read-modify-write-write which may produce incorrect
> results.
>
> ## Desiderata ##
>
> -   CLS operations should be composable. We should be able to have many
>     of them in a single operation.
> -   They should remain transactional. If a CLS operation does some
>     reads and hits an error, it stops and nothing is written to the
>     store. We should not allow situations where a CLS method can write
>     a partial result to the store then error.
> -   They should be capable. We should not put too many restrictions on
>     what an operation is allowed to do. It should be possible to run
>     them on Erasure Coded Pools once ECOverwrite is in place. (At
>     least some subset of them).
> -   They should be consistent. A CLS operation should be able to call
>     rand or generate a UUID without each replica holding a different
>     value. (This rules out solutions like calling the method on each
>     replica.)
> -   They should be efficient and optimizable.
> -   They should work in an asynchronous framework.
>
> There are several ways we could change their implementation to address
> these.
>
> ## Futures ##
>
> This is an attractive way to think about CLS. It allows things to
> proceed asynchronously and would solve the RMRMWW problem. One would
> simply make every I/O operation in the CLS API a call returning a
> future and write each method in continuation passing style. Executing
> the transaction in the primary OSD (on a replicated pool) would create
> a write-only that could then be sent to replicas. (Having the
> execution of a CLS method also compile a write-only transaction is a
> propery of any composable design.)
> -   Tracking dependencies before the operation is executed would be
>     problematic. There would be no way to know whether later reads
>     overlapped with previous writes before doing them. This could lead
>     to an unbounded obligation on the OSD to maintain state to
>     evaluate OSDOp, including potentially large writes, before
>     actually committing in order for CLS methods to remain
>     transactional.
>
> Futures are, on their own, insufficient to provide everything we need
> from CLS, largely because they are opaque to the OSD. They could be
> combined with…
>
> ## Pre-declaration ##
>
> We could remove some of the generality. A simple way of doing so would
> be to have methods declare, as a part of their signature, everything
> that they may ever read or write, with the expectation that methods
> will name the fewest resources required. This doesn't mean that every
> method will always write to and read from everything it mentions,
> merely that we have a known bound of the maximum it will ever use.
>
> This makes analysis easier for the OSD, and in the composition case,
> it could go in two passes. In the first, it would execute CLS calls
> and pre-stage results and in the second it would pass its compiled
> write transaction into the store.
>
> This is the most attractive solution, but depends on pre-declaration
> being done well on the resources used in pre-staging.
>
> One could make things easier by being even more restrictive and
> imposing ordering:
> 1.  The method declares in advance all read operations it might ever
>     perform.
> 2.  The method declares in advance all write operations it might ever
>     perform.
> 3.  The method examines the parameters passed by the client and
>     indicates which subset of the named inputs and outputs it will
>     use.
> 4.  The method performs its read operations and denotes exactly which
>     output operations it will perform. (not the data to be output, but
>     ranges and names.)
> 5.  The method performs write operations.
>
> The most restrictive form of this would operate in two phases. First,
> the CLS method would be presented with its parameters and all of the
> things it plans to read or write (objects with ranges and attribute
> keys.) In the second it would be called with the contents of all the
> reads it requested and supply the data for all the writes it requested.
>
> This would obviate the need for futures or other asynchronous I/O,
> and make evaluation very easy. This approach would disallow some
> operations, like indirecting through an attribute key to read another,
> but is very appealing.
>
> ## Be Transactional ##
>
> Our transactions are pre-checked and must succeed. If we want the most
> expressive version of CLS consistent with our other goals, then we
> should add commit and rollback. EC Overwrite will already require some
> form of commit and rollback, so it's not beyond the realm of thought.
>
> It could also be a foundation for some future multi-object-transaction
> supporting backend.
>
> This idea might have appeal on its own, but the concerns of CLS are
> not sufficient to motivate it.
>
> ## Domain Specific Language ##
>
> One could make a domain specific language, based on something simple,
> that the OSD can execute to perform CLS methods. The OSD could then
> analyze each method to see what I/O operations a method calls and try
> to track them
> -   Dependency tracking for compilers is a major area of research. It
>     would be a whole lot of fun, but as a short term solution it is
>     not really practical.
> -   We still wouldn't be able to rule out problems in the general
>     case.
>
> This approach would be interesting as a long term academic research
> project, but is not suitable for a short-range improvement.
>
> # Flexible Placement #
>
> This is a large topic which should be discussed on its own, but it
> motivates the interface designs below, so we shall briefly mention why
> it's interesting.
>
> CRUSH/PG is a fine placement system for several workloads, but it has
> two well-known limitations.
>
> ## Motivation ##
>
> -   Data distribution can be much less uniform than one might like,
>     giving uneven use of disks. This has caused some Ceph developers
>     to experiment with Monte Carlo based placement algorithms.
> -   Data distribution can be much more uniform than one would
>     like. This is the fundamental cause of Ceph's slow sequential read
>     performance. More generally, unrelated workloads contend
>     with each other due to a lack of affinity for related data. The effects are
>     especially pronounced on spinning disk (due to seek times), but
>     still exist on Flash (due to bus/network contention.)  This is a
>     tension between competing goods. CRUSH gains wide dispersion and
>     uniformity to defend against correlated failures but this imposes
>     a tradeoff.
>
> ## Goal ##
>
> Ceph should support placement methods other than CRUSH/PG. Currently,
> the OSD dispatches operations based on placement group ID, which will
> need to be varied,
>
> We also need some way to get new types of functions into the cluster.
>
> ## Proposal ##
>
> Our proposal is, in a way, CRUSH taken to its logical
> conclusion. Instead of distributing CRUSH rules, we propose to
> distribute general computable functions from (oid, volume/dataset) pairs to
> sequences of OSDs with their supporting data structures.  One of our
> ongoing research projects has been an in-process executor for these
> functions based on Google's NaCl. The benefits are:
> -   Administrators can fine-tune placement functions to fit their
>     workloads well.
> -   They can also experiment easily without having to recompile all of
>     Ceph and make heavy architectural changes.
> -   Entirely new placement strategies can be deployed without having
>     to upgrade every machine in the cluster. Or any machine in the
>     cluster, once they've been upgraded to a Flexible Placement
>     capable version.
> -   Possibilities for annealing and machine learning to gradually
>     adapt placement in response to load data become available
> -   NaCl builds on LLVM which has a rich set of tools for optimizations
>     like partial evaluation.
> -   NaCl is fast.
>
> # Flexible Semantics #
>
> Another motivating example. Originally, Ceph did replication and only
> replication under a very specific consistency model. There has been
> desire for more flexibility.
> -   Erasure Coding. it still follows the Ceph consistency model
>     (though leaves out many operations) but is very different in
>     back-end dispatch, enough so that it inspired a major rewrite of
>     the OSD's bottom half.
> -   Append-only immutable objects have been discussed.
> -   Many people have asked for relaxed consistency to improve
>     performance. This is not be suitable for all workloads, but people
>     have repeatedly asked for the ability to set up low-latency,
>     relaxed-consistency volumes that still provide Ceph's ability to
>     easily use new storage and scale well.
> -   Transactional storage. As mentioned above, cross-object
>     transactional semantics are a thing people may have desired.
>
> # Interfaces #
>
> Right now our class hierarchy is a bit of a mess. Eventually we'll do
> something about `PG` and `ReplicatedPG`, refactor, support
> asynchronous I/O, reduce lock contention, support in core affinity,
> and build Jerusalem here in England's green and pleasant land.
>
> While we're stringing up our bows of burning gold, we should support
> non-PG based placement and flexible semantics. Right now, parts of the
> PG and the OSD (since the OSD manages the collection of PGs, spins
> them up, and manages thread pools shared by sets of PGs) are
> intertwined. Thus, we need to abstract out both pieces.
>
> As we also want to support having multiple "logical" OSDs running in a
> single `ceph-osd` process, this would be a natural time to add that
> capability.
>
> Both these are sketches and should be considered a work in progress.
>
> ## `DataSetInterface` ##
>
> Here is a sketch of what a flexible abstraction based on PG could look
> like, at least parts of one. Not being informed about Scrub,
> Recovery, or Cache Tiering, having only focused on the object
> operation path, we won't include those details here.
>
> We also leave out functions called from the PG itself or other objects
> invoked from ownstack.
>
> ```c++
> class DataSetInterface {
> protected:
>   LogicalOSD& losd; // LogicalOSD is a means to have different
>                     // stores/semantics run in the same process.
>
>   MapPartRef curmap; // Subset of map relevant to this DSI
> public:
>   // The OSD (things Up the Stack, generally) should not call 'lock'
>   // on us. If we have locking of some sort things down the stack that
>   // we have some relationship with (friend or whatever) could lock or
>   // unlock us, but that should not be baked in as part of the interface.
>
>   // Things like the info struct and details about loading the Place
>   // wouldn't actually be here. As there is an intimate relation
>   // between the LogicalOSD and an implementation of DataSetInterface (it
>   // holds all those loaded in memory and controls dispatch), they
>   // would not need to be part of the generic interface.
>
>   const coll_t coll; // The subdivision of the Store we control
>
>   // In the PG case we always know we're the primary or not for
>   // anything within the same pgid. That is not expected to be the
>   // case generally.
>   bool is_primary(const OpRequest&) = 0;
>   // No 'is_replica' since 'replica' may not be applicable
>   // generally. It's a bit off even in the erasure coded case.
>   bool is_acting(const OpRequest&) = 0;
>   bool is_inactive() = 0;
>
>  public:
>   // No identifier. The descendent will take that.
>   DataSetInterface(LogicalOSD& o, OSDMapRef curmap);
>   virtual ~DataSetInterface();
>
>   DataSetInterface(const DataSetInterface&) = delete;
>   DataSetInterface& operator =(const DataSetInterface&) = delete;
>   DataSetInterface(DataSetInterface&&) = delete;
>   DataSetInterface& operator =(DataSetInterface&&) = delete;
>
>   virtual void on_removal(ObjectStore::Transaction *t) = 0;
>
>   // Yes, there's no 'queue' and no 'do_op' or any of
>   // that. This is intentional. There's no dequeue or do_op because
>   // those functions are either called only by the PG currently OR
>   // they're called in OSD functions called by the PG as part of the
>   // thread switch. They should not be part of the public interface.
>
>   // There's no queue because we can either put queue here or we can
>   // put queue in LogicalOSD. (We could do both, but that seems bad to
>   // me.) If there is some combination of locking and checking that
>   // must be done before queueing an operation, it seems that it's
>   // better to do it in LogicalOSD so that it doesn't leak out and
>   // become part of the abstraction for other implementations.
> };
> ```
>
> ## `LogicalOSD` ##
>
> The OSD class itself (representing the single OSD process) should have
> a map (*perhaps* a Boost.Intrusive.Set?) mapping OSD IDs to to
> `LogicalOSD` instances.
>
> ```c++
> class LogicalOSD {
>   OSD& osd;
>   ObjectStore& store;
>
>   // Look up the DataSetInterface instance appropriate to the given
>   // OpRequest.
>   virtual future<DataSetInterface,int> get_place_for(const OpRequest&) = 0;
>
>   // Every logical OSD will have its own watchers as well as slot
>   // cache. Someone familiar with flow control should check this
>   // idea. Since LogicalOSDs will, ideally, share messengers we might
>   // want them to share the same slot cache. In that case we should
>   // just re-dimension watchers within Session
>   SessionRef session_for(const entity_name_t& name);
>
>   void queue(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);
>   void queue_front(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);
>
>   // Dequeue and the like are currently called in the PG itself and so
>   // have no place in the interface presented to the OSD.
>
>   void pause();
>   void resume();
>   void drain();
> };
> ```
>
> ## Library ##
>
> Both these interfaces are quite thin and intentionally so. Scrubbing
> and recovery have not been addressed at all, as mentioned, so those
> parts will be expanded.  Asynchrony should allow us simpler interfaces
> since some complexity of requeing will be handled by futures and
> continuations.
>
> We obviously do not want to rewrite all our existing code. Instead
> most of the existing work on `PG` and `ReplicatedPG` should be
> refactored into a templated library from which implementations of
> `LogicalOSD` and `DataSetInterface` can be constructed.
>
> # Map Partitioning #
>
> There are two huge problems with scalability in Ceph.
> 1.  The OSDMap knows too many things
> 2.  A single monitor manages all updates of everything and replicates them to
>     other monitors.
>
> ## Too Big to Not Fail ##
>
> The monitor map and MDS maps are fine. Each holds data needed to
> locate servers and that's it. It would be very hard to put enough data
> in them to cause problems. The OSD map however contains a trove of data that
> must be updated serially in Paxos and propagated to every OSD,
> monitor, MDS, and client in the cluster.
>
> Pools are a notorious example. We can't create as many pools as users
> would like. Pools are heavyweight, and while they depend on other
> items in the OSD map (like erasure code profiles), it would be nice if
> we divide them between several monitor clusters, each of which would
> hold a subset of pools. We would need to make sure that clients had up
> to date versions of whatever pools they are using along with the
> status of the OSDs they're speaking to, but that's not
> impossible. Likewise, we should split placement rules out of the OSD
> map, especially once we get into larger numbers of potentially larger
> Flexible Placement style functions.
>
> Nodes should then only need to subscribe to the set of pools and
> placement functions they need to access their data. Changes like these
> should allow users to create the number of pools they want without
> causing the cluster difficulty.
>
> ### Consistency ###
>
> Partitioning makes consistency harder. A simple remedy might be to
> stop referring to data by name or integer. An erasure code profile
> should be specified by UUID and version. So should pools and placement
> functions. When sending a request to the OSD, a client should send the
> versions of the pool, the ruleset, and the OSDMap it used and the OSD
> should check that all three are current.
>
> ## The OSD Set ##
>
> The complicating case here is the OSD status set.  Running this
> through a single Paxos limits the number of OSDs that can coexist in a
> cluster.  We ought split the set of OSDs between multiple masters to
> distribute the load. Each 'Up' or 'Down' event is independent of
> others, so all we require is that events get propagated into the
> correct OSDs and primaries and followers act as they're supposed to.
>
> Versioning is a bigger problem here. We might have all masters
> increment their version when one increments its version if that could
> be managed without inefficiency. We might send a compound version with
> `MOSDOp`s, but combining that with the compound version above might be
> unwieldly. (Feedback on this issue would be greatly appreciated.)
>
> ### Subscription ###
>
> For a large number of OSDs, it would be nice if not everyone were
> notified of all state changes.
>
> For a pool whose placement rule spans only a subset of all OSDs,
> clients using that pool should be able to subscribe to a subset of the
> OSD set corresponding to that pool. This should be fairly easy so long
> as the subset is explicit.
>
> In the case of pools not providing an explicit subset, a monitor (or
> perhaps a proxy in front of a set of monitors) could look at common
> patterns of subscription requests and merge those with significant
> overlap together, so as to give clients a subset without being
> destroyed by the irresistible force of combinatorial explosion.
>
> # Notes #
>
> These are notes taken when reviewing the code and thinking out
> ideas. You don't have to read them, but they are provided as a
> supplement in case you wanted to know what we were thinking and why.
>
> ## ShardedOpWQ ##
>
> -   What is the purpose of `sdata_op_ordering_lock`? A shard is not a
>     PG, so why do things need to be ordered within shards as well as
>     within PGs?
> -   `sdata_lock` pairs up with the condition variable
>
> ## OSD Upper Half ##
>
> ### Regular Dispatch ###
>
> -   Does not overlap with `fast_dispatch`. Operations in
>     `ms_can_fast_dispatch` are not handled in `_dispatch` and vice versa.
> -   Lock the entire OSD
> -   If another dispatch is executing, go to sleep and wait for it to
>     finish. What the heck?
> -   Do Waiters
>     * Waiters are a list of `OpRequestRef`s called `finished` for some
>           reason
>     * Whenever we activate an `OSDMap` the requests waiting for the
>       map get put onto 'finished'
> -   Call `_dispatch`
> -   Do some more waiters
> -   Wake up other dispatch threads
> -   Unlock the entire OSD
>
> #### `_dispatch`? ####
>
> A giant case statement that does a bunch of things.
>
> In the case of `OSDOp`, if we have an `OSDMap`, create an `OpRequest` and
> pass it to `dispatch_op`. This is for things like PG commands, not
> actual object operations.
>
> #### `dispatch_op` ####
>
> Another giant case statement.
>
> ### Fast Dispatch ###
>
> #### `ms_fast_preprocess` ####
>
> Update the map epoch if an OSD sends us an OSDMap.
>
> #### `ms_fast_dispatch` ####
>
> -   Make an `OpRequest`
> -   A bit weird and convoluted, it looks like we use the 'op waiting
>     for map' stuff to queue up an op on a reserved map and remove the
>     reservation preventing it from running before we return.
> -   Specifically we mark the op as waiting for its PG in the `Session`
>     and then mark the `Session` as waiting for the new map.
> -   Ultimately things end up in `dispatch_op_fast`
>
> #### `dispatch_op_fast` ####
>
> Shovels operations into type specific calls like…
>
> #### `handle_op` ####
>
> -   Set up map share (if needed)
> -   Calculate the True PGID and Pool (sanity check against the client?)
> -   Either get the pointer to the PG (a base class) or, if it hasn't
>     been loaded in, queue the session to wait for it
> -   If we have the PG, `enqueue_op` (which just calls `PG::queue_op`)
>
> ## OSD Lower Half (Currently PG) ##
>
> `ReplicatedPG` and `PG` are separate for historical reasons and actual
> differentiation occurs in choice of backend according to Word of Sam.
>
> PGs with different consistency properties are explicitly a goal
> now. The idea of a `PGInterface` has been floated to facilitate their
> creation and `ReplicatedPG` would become a child of that.
>
> ### `PG::queue_op` ###
>
> -   Delay if other people are waiting for maps (to preserve the PG Ordering)
> -   Enqueue in `op_wq` (owned by the OSD)
> -   (Why call into the PG then? Just to enforce the map ordering?)
> -   The work queue gathers operations which, during `_process` are later
>     reassembled into a list of work to be done.
> -   `_process` is called by a worker thread in the thread pool, so the
>     call to `dequeue_op` is in worker thread. Since it's sharded, we
>     get multiple groups of threads each serving some subset of PGs.
>
> ### `OSD::dequeue_op` ###
>
> -   After a bit of fiddling about sharing maps, call `PG::do_op`
>
> ### `PG::do_op` ###
>
> -   Sam says he plans to rewrite this to allow read asynchrony
> -   We want to see reads and writes share the same transaction
>     structure and similar semantics.
> -   We also want to allow reads and writes in the same operation and
>     to use a session mechanism to allow that.
> -   We'll need transaction transforms to, for example, filter out
>     reads before sending an operation to a replicating OSD. This
>     shouldn't be too hard, since the output of read operations can't
>     be the input for write operations. (Except in CLS?)
> -   `do_op` is a virtual function, but the only implementation is in
>     ReplicatedPG.
> -   Here looks to be where we apply ordering to Writes
> -   `execute_ctx` actually performs the operations after `do_op` has set
>     everything up
>
> ### `execute_ctx` ###
>
> -   May be called multiple times on the same `OpContext`
> -   In the case of clone operations (that's the only thing that takes
>     `src_obc`?), get a read-lock for the object context
> -   it's called `ondisk` but I'm not sure why, it doesn't look like they get serialized
> -   Then we have a brief detour into `prepare_transatcion`
> -   Here's the read-write restriction. ReplicatedPG.cc:2975. Later we can
>     create a better session abstraction to fix that.
> -   For reads
>     *   `do_osd_op_effects`!
>     *   If all our reads were synchronous (or there were none)
>         `complete_read_ctx`, which creates and sends the reply
>     -   Otherwise, `start_async_reads`, which passes the pending reads off
>         to `objects_read_async`
>     -   Once the backend completes, we go to `finish_read`, which calls
>         `complete_read_ctx`
> -   Trim the PG Log
> -   Hey, cool, there's a lambda! Register an `onack` closure that sends a reply
> -   And `oncommit`. And `onsuccess`. And `finish`.
> -   Package up the `OpContext` and its transaction and whatnot into a
>     `RepOp`. This is where all the mutations get done.
> -   Call `issue_repop`
> -   Call `eval_repop`
> -   Adam really wishes we would use `boost::intrusive_ptr` everywhere
>     and stop using explicit gets and puts.
>
> ### `prepare_transaction` ###
>
> -   `do_osd_ops`!
> -   If we're not full, `finish_ctx`.
>
> ### `do_osd_ops` ###
>
> -   Loop over the ops in a gigantic case statement
> -   If we hit any modification ops set the `user_modify` flag. This is
>     used to update the object version as part of the transaction
> -   On EC pools, do reads asynchronously, pushing them onto a list of
>     reads to complete.
> -   Otherwise do the reads synchronously
> -   CLS calls can be tricky since they read or write depending on the
>     method invoked
> -   It looks like operations performed by CLS are done by calling each
>     operation individually with `do_osd_ops` with reads being done
>     immediately and writes being queued up as part of the transaction
> -   Making the CLS API futures-based interface may be a good thing to do.
> -   Cache ops like flushing seem to be about shovelling triggers to do
>     perform actions into the `onack`/`oncomplete` lists.
> -   For write operations, stuff them into the Transaction
> -   In the case of CLS operations which do both reads and writes
>     (which some of them do), it appears that putting two CLS operations
>     in the same OSDOp might lead to weird results since all the reads
>     will happen then all the writes.
>
> ### `finish_ctx` ###
>
> -   Fiddle with object state and logs to update snapshot foo and to make
>     sure the object exists in the form we need it
> -   Update user version if we modified the object
> -   Save the updated `object_info_t`
> -   Append the updated object info to the `PGLog`
> -   Apply context stats
>
> ### `do_osd_op_effects`
>
> -   Add watches if we need to add watches
> -   If there's notifies, notify the watchers
> -   Why do we ack notifies?
>
> ### `issue_repop` ###
>
> -   Acquire locks (I'm still not clear why they're called `ondisk`. Is
>     it a lock acquired to use the store and thus it locks the on-disk
>     representation?)
> -   Apply built up attributes (likely verions and things that had been
>     stuck in the PGLog before.)
> -   Submit transaction to the PG Backend. Which is where it gets
>     divided up for Erasure Coding or sent out for replication. I'll
>     count that as Bottom End for the moment alongside the Store,
>     Changes to the backend will be for new consistency models.
>
>     We might be able to get a separation of concerns by varying what
>     is now ReplicatedPG to support differnet 'gridding' of objects on
>     the OSD and rejigger things so the consistency model is purely a
>     property of the backend. That's appealing from a maintenance
>     perspective, but breaks down if we want things like explicitly
>     marked transactions across multiple for some volumes while not
>     paying for them on others. It might not be workable in the general
>     case.
> -   That's also where local application takes place.
>
> ### `eval_repop` ###
>
> -   This function just sends notifications and cleans up when we finish.
> -   Its name is not very appropriate for what it does.
> -   If we're already done, return.
>     *   This isn't bad, but it's specifically necessary because `eval_repop`
>         gets called from several places including the handlers for our
>         subservient OSDs completing an operation.
> -   If everyone's ack'd, fire off our ack handlers. If everyone's
>     completed, fire off our completion handlers.
> -   Notify anyone waiting for the version we've committed…
> -   And for those waiting on the one we've applied
> -   If we've done everything, update usage stats
>     *   Fire off `on_success` callbacks
>     *   Remove ourselves
>
> ## Flex Points ##
>
> ### PlacementGroup/FlexiblePlacement/OtherConsistencyStrategy ###
>
> -   Fast Dispatch currently shoves requests into a PG.
> -   `handle_op` calculates a pgid and actually gets the pointer to or
>     queues the session to wait on the associated PG
> -   If we implement `queue_op` in FlexiblePlacement we can do whatever we
>     want with it. We can ignore the WorkQueue.
> -   Much of the code in `ReplicatedPG` is useful even with other
>     semantic models than PG-ordered replication
> -   We might want to make `ReplicatedPG` a template and
>     supply the `PG` specific parts as a class instantiation. Then we
>     could create more classes for other partition/dispatch models.
> -   We will want a consitency/semantic variation orthogonal to the
>     partition/dispatch model.
>         * In this divide dividing objects into PGs where every all
>           operations are dispatched into the PG for whatever objet they
>       effect would be partition/gridding
>     * Whereas the total ordering on PG operations and constraints on
>           when a request blocks versus being served are the
>           consistency/semantics
>
> ### Allocation/Locking/Dispatch ###
>
> -   `OpRequest` (currently allocated in `do_op` and other structures
>     might be allocated at various points. IN our earlier prototype we
>     allocated OpRequest and another structure alongside the MOSDOp and
>     reused MOSDOps rather than deallocating them to cut down on
>     allocator use in the fast path.
>
>         That might fight with also promising designs using core-affine
>     memory management, unless we can determine core affinity quickly
>     before allocating the message. (Maybe peeking into the undecoded
>     bytes?)
> -   Lock freedom should be orthogonal to flexible placement. There may
>     be situations where we want lockful systems in flexible placement
>     (since flexible placement can have a variety of sync behaviors.)
>     and we know that Sam and others are interested in pursuing
>     lock-free designs in in PG-placement.
> -   In a lock-free design, if PGs are core-affine,
>     `enqueue_op` could just submit a message to a core without locking
>     or some of the thread/worker complexity.
> -   For Volumes, where the volume itself may be partitioned across cores
>     `enqueue_op` would have to look at the object name to find its target.
> -   Thus, we would want to pull that logic into a separate function
>     giving our dispatch target.
>
> ### Read-Write Symmetry ###
>
> -   Thankfully, `init_op_flags` is happy to set both read and write
> -   CLS in particular falls afoul of this. Futures might be the best
>     way to deal with it.
>
> ### Things we know we had to do anyway from previous work ###
>
> -   Use `std::map` less as a parameter/return type, same for std::set
> -   Objecter improvements
>     *   Less allocation, change data structures. A dual to some of the
>         work we want to do to make the EC interface less memory
>         intensive.
>     -   If we have zero copy there should be a way to materialize that
>         at the level of the client.
> -   See about bootstrapping client-side EC from EC overwrite
> -   Librados4 should be more like Objecter than it is like librados3
>
> ## Sam and `do_op` (♪ Doo-Wop? ♪) ##
>
> ### Discussion ###
>
> Notes taken during a BlueJeans call between Adam Emerson and Sam
> Just. (Sorry for any mistakes, recording a conversation while having
> it is tricky.)
>
> -   We should never have to block for I/O
> -   It's not `do_op` per se, though we are rewriting that to put it into a
>     continuation passing style with trampolines
> -   Various bits should be allowed to block, but whether they do or
>     don't should not effect the caller's code-flow.
> -   Once we've got to that point, everything after is easier
> -   We have to make sure we don't introduce so much overhead that it's measurable
> -   Eventually plans to go to a lock-free/sharded/partitioned style like Seastar
> -   We are not using Seastar's system because, when you fulfil a
>     promise you don't want to have the promise fulfilled in that
>     thread, it should be easy to fulfill it in a different thread.
> -   Also adapting an existing codebase to Seastar is much harder than
>     writing one from scratch to use it.
> -   It should also allow us to run all the OSDs in the same process
> -   We might want to have one messenger per logical OSD and have those share
>     threads (loses some efficiency gains but is backwards compatible.)
> -   These sorts of changes will also make EC overwrites much easier.
> -   Any refactors in the code should move us in this direction as a side effect
> -   The sooner the better, so if it does cause performance problems we
>     can find out soon
> -   Branch is wip-do-op in athanatos
>
> ### Brief Exploration of the code ###
>
> Adam Emerson looked briefly through the `wip-do-op` branch in
> `https://github.com/athanatos/ceph.git` to see what the general design
> looked like and how it matched up with our goals.
>
> -   Getting rid of the 'ondisk lock' looks good, someone good at
>     scheduling (Matt?) should review the queue. It should not use
>     `std::list`, though.
> -   The `do_replica_safe_reads` refactor isn't bad but doesn't seem to
>     have an immediate effect. Sam described it as providing safety
>     shunting things replicas could do into their own function, so
>     should make future development and refactor easier.
> -   It reinforces the idea that reads inhabit a separate magisterium
>     with its own law and dispensation from writes and is the oposite
>     direction from the read/write transactions we want. At least
>     potentially, we could use it as a fast/safe path and have it do a
>     more specialized transaction dispatch for reads, maybe.
> -   The `do_op`/`do_replica_op` split seems reasonable for the
>     replicated case, since in that one we want to transform the
>     transaction before sending it to the replicas. If we want to allow
>     CLS methods on EC pools (which we do, in principle) or mixed
>     read-write, then the distinction between primary and replica might
>     break down.
> -   Not sure if the error channel is better pe se, but since we
>     currently have a bunch of functions that return `int` to indicate
>     errors, it might be easier to integrate.
> -   C++ should have a `void` type a bit more like unit so you could
>     explicitly return `void()` from void functions. You'd think they
>     could put *that* in C++17 since their list of things to add to the
>     standard now consists entirely of "3 to the version number".
> -   The `future` implementation looks promising. I'll need to review
>     how it's put together in more detail later, how it's used is more
>     pertinent at the moment.
> -   Things make sense from a gradualist position. Given the desire for
>     a progression from from here to _A Really Fast OSD_ where we have
>     _A Working OSD_ at every point along the way, this approach makes
>     sense. Restructuring everything around a blocking-agnostic futures
>     design then opens the way to introducing asynchronous, lock-free code.
> -   This is also compatible with flexing, since we can have multiple
>     `LogicalOSD` implementations with different locking strategies or
>     core affinity.
> -   `aio_read` looks to be less aio than the name would suggest. This
>     isn't bad, it's reasonable to do a transform by having things call
>     blocking procedures in a way that will work if they become non-blocking.
> -   Reimplementing the blocking calls in terms of nonblocking calls is
>     smart.
> -   `OSDReactor` looks like it could be adapted, at least the public
>     interface, into LogicalOSD once we made it less PG specific.
> -   In principle it's a good idea. A LogicalOSD would have to be bound
>     closely to the DataSetInterface it worked with since they're two
>     halves of a queueing mechanism.
> -   The futures stuff definitely isn't naïve. We need to understand
>     the blockers and other details.  The idea of having a future yield
>     when it needs to wait for something is a good one.
> -   It uses `std::list` though.
>
> ## Why librados is not wonderful ##
>
> Not that we hate RADOS, we just like Objecter way more
> -   Does not support read and write in same op. Neither does RADOS, to
>     be fair, but we plan to fix that.
> -   Takes a giant lock with every operation. Yuck.
> -   Has its own 'callback' interface
> -   Its handing of asynchronous operations seems very heavyweight and
>     not natural.
> -   Hides the internal structure of RADOS operations
> -   Does not expose object locator in a useful way
> -   Does way too many allocations
> -   The dimensioning of the interface is weird, like binding the IoCtx
>     to a pool

As librados is today, it encapsulates it's own networking and event
code. This is pretty frustrating to apps that already have their own
eventloop that want to integrate it.

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
  2016-04-15 21:25 ` Milosz Tanski
@ 2016-04-15 22:12   ` Adam C. Emerson
  2016-04-18 17:19     ` Milosz Tanski
  0 siblings, 1 reply; 15+ messages in thread
From: Adam C. Emerson @ 2016-04-15 22:12 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

On 15/04/2016, Milosz Tanski wrote:
[snip]
> As librados is today, it encapsulates it's own networking and event
> code. This is pretty frustrating to apps that already have their own
> eventloop that want to integrate it.

When you say integrating with an event loop, would being able to hand off a
function or object to asynchronous calls that would then use whatever mechanism
your event loop had for submitting events be sufficient?

I'm not sure what you mean by networking in this case. Some means for the
library to ask your client to set up or terminate a connection when it needs to
talk to an OSD or monitor?

-- 
Senior Software Engineer           Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
  2016-04-15 22:12   ` Adam C. Emerson
@ 2016-04-18 17:19     ` Milosz Tanski
  0 siblings, 0 replies; 15+ messages in thread
From: Milosz Tanski @ 2016-04-18 17:19 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

On Fri, Apr 15, 2016 at 6:12 PM, Adam C. Emerson <aemerson@redhat.com> wrote:
> On 15/04/2016, Milosz Tanski wrote:
> [snip]
>> As librados is today, it encapsulates it's own networking and event
>> code. This is pretty frustrating to apps that already have their own
>> eventloop that want to integrate it.
>
> When you say integrating with an event loop, would being able to hand off a
> function or object to asynchronous calls that would then use whatever mechanism
> your event loop had for submitting events be sufficient?
>
> I'm not sure what you mean by networking in this case. Some means for the
> library to ask your client to set up or terminate a connection when it needs to
> talk to an OSD or monitor?

Sorry I wasn't explicit enough in the first round (eod friday).

Ideally the library would let you reentrant and let you hand off all
networking eg. opening, closing connections & read/write to your
chosen API. In my case I would hook it up to folly, but I can imagine
people would want to use their library of choice (libev / libevent /
boost::asio / whatever).

This can open possibilities like using DPDK sdk (via SeaStar) or
tighter integration with the qemu eventloop.


-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
  2016-04-15 21:05 Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Adam C. Emerson
  2016-04-15 21:25 ` Milosz Tanski
@ 2016-04-15 22:09 ` Gregory Farnum
  2016-04-15 22:29   ` Sessions and Persistence Adam C. Emerson
  2016-04-16  0:03   ` Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Mark Nelson
  2016-04-16  0:07 ` Matt Benjamin
  2016-04-18  4:00 ` Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Haomai Wang
  3 siblings, 2 replies; 15+ messages in thread
From: Gregory Farnum @ 2016-04-15 22:09 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

On Fri, Apr 15, 2016 at 2:05 PM, Adam C. Emerson <aemerson@redhat.com> wrote:
> Ceph Developers,
>
> We've put together a few of the main ideas from our previous work in a
> brief form that we hope people will be able to digest, consider, and
> debate. We'd also like to discuss them with you at Ceph Next this
> Tuesday.
>
> Thank you.
>
>
> ---8<---
>
>
> We have been looking at improvements to Ceph, particularly RADOS,
> while focusing on flexibility (allowing users to do more things)
> and performance. We have come up with a few proposals with these two
> things in mind. Sessions and read-write transactions aim to allow
> clients to batch up multiple operations in a way that is safe and
> correct, while allowing clients to gain the advantages of atomic
> read-write operations without having to lock. Sessions also provide
> a foundation for flow-control which ultimately improves performance
> by preventing an OSD from being ground into uselessness under a
> storm of impossible requests. The CLS proposal is a logical follow-on
> from the read-write proposal, as we attempt to address some problems
> of correctness that exist now and consider how to integrate the
> facility into an asynchronous world.
>
> Flexible Placement, as you would expect from the name, is about
> allowing users more control, as are Flexible Semantics. They both
> have profound performance implications, as tuning placement to better
> match a workload can increase throughput, and relaxed consistency can
> decrease latency. The proposed Interfaces are meant to support both as
> well as work currently being done to allow an asynchronous OSD and to
> hide details like locking and thread pools so that backends can be
> written with different forms of concurrency and load-balancing
> across processors.
>
> Finally, Map Partitioning is not directly related to code paths within
> the OSD itself, but does affect everything that can be done with Ceph.
> People are beginning to run into limits on how large a Ceph cluster can
> grow and how many ways they can be partitioned, and both these problems
> fundamentally derive from the way the OSD map is handled by the monitors.
>
> There are also some notes at the end. They are not critical, but if you
> find yourself asking "What were they thinking?" the notes might help.
>
> # Sessions and Read-Write #
>
> From `ReplicatedPG.cc`.
>
> ```c++
> // Write operations aren't allowed to return a data payload because
> // we can't do so reliably. If the client has to resend the request
> // and it has already been applied, we will return 0 with no
> // payload.  Non-deterministic behavior is no good.  However, it is
> // possible to construct an operation that does a read, does a guard
> // check (e.g., CMPXATTR), and then a write.  Then we either succeed
> // with the write, or return a CMPXATTR and the read value.
> …
> if (ctx->op_t->empty() || result < 0) {
>   …
>   if (ctx->pending_async_reads.empty()) {
>     complete_read_ctx(result, ctx);
>   } else {
>     in_progress_async_reads.push_back(make_pair(op, ctx));
>     ctx->start_async_reads(this);
>   }
>   return;
> }
> …
> // issue replica writes
> ceph_tid_t rep_tid = osd->get_tid();
>
> RepGather *repop = new_repop(ctx, obc, rep_tid);
>
> issue_repop(repop, ctx);
> eval_repop(repop);
> ```
>
> As you can see, if we have any writes (all mutations end up in the
> `op_t` transaction), we just flat out don't do the requested read
> operations. If we don't have any writes, we perform the read
> operations and return.  This is justified in the comment above because
> of the non-deterministic behavior of resent read-write operations.
>
> This is not an unsolved problem and we can bootstrap a solution on our
> existing `Session` infrastructure.
>
> ## An upgraded session ##
>
> Behold, `OSDSession`:
> ```c++
> struct Session : public RefCountedObject {
>   EntityName entity_name;
>   OSDCap caps;
>   int64_t auid;
>   ConnectionRef con;
>   WatchConState wstate;
>   …
> };
> ```
>
> This structure exists once for every connection to the OSD. Where they
> are created depends on who is doing the creation. In the case of
> clients (what we're interested in) it occurs in `ms_handle_authorizeri`
> ```c++
> …
> isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
>                                                authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);
>
> if (isvalid) {
>   Session *s = static_cast<Session *>(con->get_priv());
>   if (!s) {
>     s = new Session(cct);
>     con->set_priv(s->get());
>     s->con = con;
>     dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
>   }
>
>   s->entity_name = name;
>   if (caps_info.allow_all)
>     s->caps.set_allow_all();
>   s->auid = auid;
>   …
> }
> ```
>
> In order to solve this problem, we propose a new data structure,
> modelled on NFSv4.1
> ```c++
> struct OpSlot {
>   uint64_t seq;
>   int r;
>   MOSDOpReplyRef cached; // Nullable
>   bool completed;
> };
> ```
>
> We do not want to give the OSD an unbounded obligation to hang on to
> old message replies: that way lies madness. So, the additions to
> `Session` we might make are:
>
> ```c++
> struct Session : public RefCountedObject {
>   …
>   uint32_t maxslots; // The maximum number of operations this client
>                      // may have in flight at once;
>   std::vector<OpSlot> slots // The vector of in-progress operations
>   ceph::timespan slots_expire; // How long we wait to hear from a
>                                // client before the OSD is free to
>                                // drop session resources
>   cepu::coarse_mono_time last_contact; // When (by our measure) we
>                                        // last received an operation
>                                        // from the client.
> };
> ```
>
> ## Message Additions ##
>
> The OSD needs to communicate this information to the client. The most
> useful way to do this is with an addition to `MOSDOpReply`.
>
> ```c++
> class MOSDOpReply : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   uint32_t max_slot;
>   ceph::timespan timeout;
>   …
> };
> ```
>
> This overlaps with the function of the transaction ID, since the
> slot/sequence/OSD triple uniquely identifies an operation. Unlike the
> transaction ID, this provides consistent semantics and a measure of
> flow control.
>
> To match our reply, the `MOSDOp` would need to be amended.
> ```c++
> class MOSDOp : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   bool please_cache;
>   …
> };
> ```
>
> ## Operations ##
>
> ### Connecting ###
>
> A client, upon connecting to an OSD for the first time should send a
> `this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
> should use the `this_slot` and `this_seq` values from before it lost
> its connection. If an OSD has state for a client and receives a
> `(slot,seq) = (0,0)` then it should feel free to free any saved state
> and start anew.
>
> ### OSD Feedback ###
>
> In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
> the value from the `MOSDOp` to which we're replying.
>
> More usefully, the OSD can inform the client how many operations it is
> allowed to send concurrently with `max_slot`. The client must **not**
> send a slot value higher than `max_slot`. (The OSD should error if it
> does.)
>
> The OSD may increase the number of operations allowed in-flight
> if it has capacity by increasing `max_slot`. If it finds itself
> lacking capacity, it may decrease `max_slot`. If it does, the client
> should respect the new bound. (The OSD should feel free to free the
> rescinded slots as soon as the client sends another `MOSDOp` with a
> slot value equal to one on which the new `max_slot` has been sent.)
>
> If the client sends a `this_seq` lower than the one held for a slot by
> the OSD, the OSD should error. If it is more than one greater than the
> current `this_seq`, the OSD should error.
>
> ### Caching ###
>
> The client is in an excellent position to know whether it **requires**
> the output of a previous operation of mixed reads and writes on
> resend, or whether it merely needs the status on resend. Thus, we let
> the client set `please_cache` to request that the OSD store a
> reference to the sent message in the appropriate `OpSlot`.
>
> The OSD is in an excellent position to know how loaded it is. It can
> calculate a bound on how large a given reply will be before executing
> it. Thus, the OSD can send an error if the client has requested it
> cache something larger than it feels comfortable caching.
>
> Assuming no errors, the behavior, for any slot, is this: If the client
> sends an `MOSDOp` with a `this_seq` one greater than the current value
> of `OpSlot::seq`, that represents a new operation. Increment
> `OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
> the operation finishes, set `OpSlot::completed`. If `please_cache` has been
> set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
> result code in `OpSlot::r`.
>
> If the client sends an `MOSDOp` with a `this_seq` equal to
> `OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
> will reply when it completes.) If it has completed, send the stored
> `OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
> with just `OpSlot::r`.
>
> ### Reconnection ###
>
> Currently the `Session` is destroyed on reset and a new one is created
> on authorization. In our proposed system the `Session` will not be
> destroyed on reset, it will be moved to a structure where it can be
> looked up and destroyed after `timeout` since the last message
> received.
>
> On connection, the OSD should first look up a `Session` keyed
> on the entity name and create one if that fails.

So the most common time we really get replay operations is when one of
the OSDs crash or a PG's acting set changes for some other reason.
Which means these "cached" operation results need to be persisted to
disk and then cleaned up, a la the pglog.
I don't see anything in these data structures that explains how we do
that efficiently, which is the biggest problem and the reason we don't
already do reply caching. Am I missing something?

And do you think maybe you could split this up into a thread for each
topic? I'm having trouble digesting it as such a wall of text. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Sessions and Persistence
  2016-04-15 22:09 ` Gregory Farnum
@ 2016-04-15 22:29   ` Adam C. Emerson
  2016-04-15 22:38     ` Gregory Farnum
  2016-04-18  0:56     ` Sage Weil
  2016-04-16  0:03   ` Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Mark Nelson
  1 sibling, 2 replies; 15+ messages in thread
From: Adam C. Emerson @ 2016-04-15 22:29 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

On 15/04/2016, Gregory Farnum wrote:
> So the most common time we really get replay operations is when one of
> the OSDs crash or a PG's acting set changes for some other reason.
> Which means these "cached" operation results need to be persisted to
> disk and then cleaned up, a la the pglog.
> I don't see anything in these data structures that explains how we do
> that efficiently, which is the biggest problem and the reason we don't
> already do reply caching. Am I missing something?

So! I had been considering the usual case of resend to be transient connection
drop between client and OSD. (An example of why feedback is nice :)

I /had/ thought of persisting thee things as a possible feature we would want to
add that administrators could turn on or off depending on the level of
reliability they wanted (and if they had some NVRAM on the machine.)

I had not thought specifically about persisting them QUICKLY in the
spinning disk case. One optimization would be refusing to cache read-only
ops so we don't have to pay for a disk-write unless we're using a disk
write. My intuition would suggest a per-OSD op-log that gets written
and committed when the PGLog entry gets committed, but I admit that's
just spur of the moment. It needs a bit more design work, but bundling
it with some of the writes we have to do already seems promising.

> And do you think maybe you could split this up into a thread for each
> topic? I'm having trouble digesting it as such a wall of text. :)

All right. I'll try to make a new thread subject new concerns as people bring
them up. (Like this one.)

-- 
Senior Software Engineer           Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Sessions and Persistence
  2016-04-15 22:29   ` Sessions and Persistence Adam C. Emerson
@ 2016-04-15 22:38     ` Gregory Farnum
  2016-04-15 22:44       ` Matt Benjamin
  2016-04-16  3:03       ` Adam C. Emerson
  2016-04-18  0:56     ` Sage Weil
  1 sibling, 2 replies; 15+ messages in thread
From: Gregory Farnum @ 2016-04-15 22:38 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

On Fri, Apr 15, 2016 at 3:29 PM, Adam C. Emerson <aemerson@redhat.com> wrote:
> On 15/04/2016, Gregory Farnum wrote:
>> So the most common time we really get replay operations is when one of
>> the OSDs crash or a PG's acting set changes for some other reason.
>> Which means these "cached" operation results need to be persisted to
>> disk and then cleaned up, a la the pglog.
>> I don't see anything in these data structures that explains how we do
>> that efficiently, which is the biggest problem and the reason we don't
>> already do reply caching. Am I missing something?
>
> So! I had been considering the usual case of resend to be transient connection
> drop between client and OSD. (An example of why feedback is nice :)

Well, I guess I don't have in-the-field information about the relative
prevalence of these scenarios. But we definitely can't include
features in RADOS that work "as long as you don't have acting set
changes". ;)

>
> I /had/ thought of persisting thee things as a possible feature we would want to
> add that administrators could turn on or off depending on the level of
> reliability they wanted (and if they had some NVRAM on the machine.)
>
> I had not thought specifically about persisting them QUICKLY in the
> spinning disk case. One optimization would be refusing to cache read-only
> ops so we don't have to pay for a disk-write unless we're using a disk
> write. My intuition would suggest a per-OSD op-log that gets written
> and committed when the PGLog entry gets committed, but I admit that's
> just spur of the moment. It needs a bit more design work, but bundling
> it with some of the writes we have to do already seems promising.

This is something I've suggested in the past, but I think it's at the
stage where somebody needs to write code demonstrating it is something
approaching performant. If it is, I don't think anybody opposes the
idea; if it's not, then throughput/IOP regressions are not a tradeoff
Sam/Sage are willing to make for this IIRC (and, though I am more
optimistic than I remember them being about our odds of success, I
suppose I'm not either).
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Sessions and Persistence
  2016-04-15 22:38     ` Gregory Farnum
@ 2016-04-15 22:44       ` Matt Benjamin
  2016-04-16  3:03       ` Adam C. Emerson
  1 sibling, 0 replies; 15+ messages in thread
From: Matt Benjamin @ 2016-04-15 22:44 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: The Sacred Order of the Squid Cybernetic

Hi,

----- Original Message -----
> From: "Gregory Farnum" <gfarnum@redhat.com>
> To: "The Sacred Order of the Squid Cybernetic" <ceph-devel@vger.kernel.org>
> Sent: Friday, April 15, 2016 6:38:39 PM
> Subject: Re: Sessions and Persistence
> 
...

> This is something I've suggested in the past, but I think it's at the
> stage where somebody needs to write code demonstrating it is something
> approaching performant. If it is, I don't think anybody opposes the
> idea;

That's reasonable.

 if it's not, then throughput/IOP regressions are not a tradeoff
> Sam/Sage are willing to make for this IIRC (and, though I am more
> optimistic than I remember them being about our odds of success, I
> suppose I'm not either).

There seems to be violent agreement about this.

Matt

> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Sessions and Persistence
  2016-04-15 22:38     ` Gregory Farnum
  2016-04-15 22:44       ` Matt Benjamin
@ 2016-04-16  3:03       ` Adam C. Emerson
  1 sibling, 0 replies; 15+ messages in thread
From: Adam C. Emerson @ 2016-04-16  3:03 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

On 15/04/2016, Gregory Farnum wrote:
[snip]
> Well, I guess I don't have in-the-field information about the relative
> prevalence of these scenarios. But we definitely can't include
> features in RADOS that work "as long as you don't have acting set
> changes". ;)
[snip]
> This is something I've suggested in the past, but I think it's at the
> stage where somebody needs to write code demonstrating it is something
> approaching performant. If it is, I don't think anybody opposes the
> idea; if it's not, then throughput/IOP regressions are not a tradeoff
> Sam/Sage are willing to make for this IIRC (and, though I am more
> optimistic than I remember them being about our odds of success, I
> suppose I'm not either).

All right, I realize the issue now. Since the target can change if the primary
changes, the reply cache has to be not only persistent but replicated along a
PG. This makes plumbing and expiring it more complicated. But, since the reply
can be determined before sending the transaction down into the store, it
should still be fundamentally doable in a way that shouldn't hurt performance. (A
larger volume of data would need to be sent, but it should be sendable in the
same write.)

Obviously, testable code with a lack of performance regressions proclaims
and speculation about performance waits for a half hour then rides public
transportation, so I think we'll have to give this some thought and come
up with a working prototype to demonstrate the point.

So, I'll see if we can whip up a new description and a WIP branch on this
topic before too long.

-- 
Senior Software Engineer           Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Sessions and Persistence
  2016-04-15 22:29   ` Sessions and Persistence Adam C. Emerson
  2016-04-15 22:38     ` Gregory Farnum
@ 2016-04-18  0:56     ` Sage Weil
  1 sibling, 0 replies; 15+ messages in thread
From: Sage Weil @ 2016-04-18  0:56 UTC (permalink / raw)
  To: Adam C. Emerson; +Cc: The Sacred Order of the Squid Cybernetic

On Fri, 15 Apr 2016, Adam C. Emerson wrote:
> On 15/04/2016, Gregory Farnum wrote:
> > So the most common time we really get replay operations is when one of
> > the OSDs crash or a PG's acting set changes for some other reason.
> > Which means these "cached" operation results need to be persisted to
> > disk and then cleaned up, a la the pglog.

Yeah

> > I don't see anything in these data structures that explains how we do
> > that efficiently, which is the biggest problem and the reason we don't
> > already do reply caching. Am I missing something?
> So! I had been considering the usual case of resend to be transient connection
> drop between client and OSD. (An example of why feedback is nice :)
> 
> I /had/ thought of persisting thee things as a possible feature we would want to
> add that administrators could turn on or off depending on the level of
> reliability they wanted (and if they had some NVRAM on the machine.)

Yeah, unfortunately they'd have to be persisted all the time, probably 
attached to the pglog entry as Greg mentioned.  Which I think makes this 
pretty much orthogonal to the persistent session discussion.  We can do 
persistent sessions *now* and cache replies so that if a transient error 
forces an OSD to reconnect and the session is still there it won't have to 
resent its writes.  Then it's just an optimization to reduce the impact of 
the failure case (and may or may not be worthwhile, depending on how 
frequent we think that will be).

But to make the read/write ops idempotent, we'll need to persist the reply 
with the update itself.  (Right now the successful reply contains no real 
information, so the existence of a pglog entry or a oi.last_reqid match is 
enough.)  Even if we did do that, the user would need to be careful never 
to have the read very big or else they're turning lots of read data 
into write data.

The big advantage of doing this seems to be that you can pipeline reads 
and writes.  This read+write op is just one example of that, but in the 
end the end point is that you persist read results between writes so that 
the client doesn't have to wait.  But I'm skeptical.  It's a huge amount 
of complexity, and expensive... is it really worth it?  Or can the client 
just wait for the write before sending the read, or vice versa?  You 
wouldn't do anything remotely weird like this with a conventional 
storage stack because latencies aren't that large... and it will be harder 
for us to keep latencies down with complexity like this.

> I had not thought specifically about persisting them QUICKLY in the
> spinning disk case. One optimization would be refusing to cache read-only
> ops so we don't have to pay for a disk-write unless we're using a disk
> write.

I think that works in the simple case, but not if you pipeline, say, read 
(nocache), then write, then <disconnect>.  The write will have 
persisted while we reconnect and our read result is gone.  Of course, the 
client may not care.. but if that's the case we don't really need any of 
this.

I think my real question is what are some workloads that really need this.  
FWIW I think I've only seen *one* user of the current read/write 
transaction 'fail with data or do write' so far.  I'm pretty sure RGW has 
lots of cases of 'do some class op' followed by a read to see the result, 
though, and that slows things down.

Perhaps if the interface made the write "result" payload something 
explicit/separate.  For example, the a class op could do some transaction 
and populate the write result payload with some new state (which it 
presumably knows).  Then it isn't necessary to build a fully general "do 
arbitrary read operation that orders post-update", which is pretty 
complex, and probably not an efficient way to address the above cls 
mutation op anyway.  This way the write result payload is a known special 
thing and the users will hopefully keep it small to make attaching it to 
pg_log_entry_t (and/or object_info_t) okay...

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
  2016-04-15 22:09 ` Gregory Farnum
  2016-04-15 22:29   ` Sessions and Persistence Adam C. Emerson
@ 2016-04-16  0:03   ` Mark Nelson
  1 sibling, 0 replies; 15+ messages in thread
From: Mark Nelson @ 2016-04-16  0:03 UTC (permalink / raw)
  To: Gregory Farnum, The Sacred Order of the Squid Cybernetic



On 04/15/2016 05:09 PM, Gregory Farnum wrote:
> On Fri, Apr 15, 2016 at 2:05 PM, Adam C. Emerson <aemerson@redhat.com> wrote:
>> Ceph Developers,
>>
>> We've put together a few of the main ideas from our previous work in a
>> brief form that we hope people will be able to digest, consider, and
>> debate. We'd also like to discuss them with you at Ceph Next this
>> Tuesday.
>>
>> Thank you.
>>
>>
>> ---8<---
>>
>>
>> We have been looking at improvements to Ceph, particularly RADOS,
>> while focusing on flexibility (allowing users to do more things)
>> and performance. We have come up with a few proposals with these two
>> things in mind. Sessions and read-write transactions aim to allow
>> clients to batch up multiple operations in a way that is safe and
>> correct, while allowing clients to gain the advantages of atomic
>> read-write operations without having to lock. Sessions also provide
>> a foundation for flow-control which ultimately improves performance
>> by preventing an OSD from being ground into uselessness under a
>> storm of impossible requests. The CLS proposal is a logical follow-on
>> from the read-write proposal, as we attempt to address some problems
>> of correctness that exist now and consider how to integrate the
>> facility into an asynchronous world.
>>
>> Flexible Placement, as you would expect from the name, is about
>> allowing users more control, as are Flexible Semantics. They both
>> have profound performance implications, as tuning placement to better
>> match a workload can increase throughput, and relaxed consistency can
>> decrease latency. The proposed Interfaces are meant to support both as
>> well as work currently being done to allow an asynchronous OSD and to
>> hide details like locking and thread pools so that backends can be
>> written with different forms of concurrency and load-balancing
>> across processors.
>>
>> Finally, Map Partitioning is not directly related to code paths within
>> the OSD itself, but does affect everything that can be done with Ceph.
>> People are beginning to run into limits on how large a Ceph cluster can
>> grow and how many ways they can be partitioned, and both these problems
>> fundamentally derive from the way the OSD map is handled by the monitors.
>>
>> There are also some notes at the end. They are not critical, but if you
>> find yourself asking "What were they thinking?" the notes might help.
>>
>> # Sessions and Read-Write #
>>
>>  From `ReplicatedPG.cc`.
>>
>> ```c++
>> // Write operations aren't allowed to return a data payload because
>> // we can't do so reliably. If the client has to resend the request
>> // and it has already been applied, we will return 0 with no
>> // payload.  Non-deterministic behavior is no good.  However, it is
>> // possible to construct an operation that does a read, does a guard
>> // check (e.g., CMPXATTR), and then a write.  Then we either succeed
>> // with the write, or return a CMPXATTR and the read value.
>> …
>> if (ctx->op_t->empty() || result < 0) {
>>    …
>>    if (ctx->pending_async_reads.empty()) {
>>      complete_read_ctx(result, ctx);
>>    } else {
>>      in_progress_async_reads.push_back(make_pair(op, ctx));
>>      ctx->start_async_reads(this);
>>    }
>>    return;
>> }
>> …
>> // issue replica writes
>> ceph_tid_t rep_tid = osd->get_tid();
>>
>> RepGather *repop = new_repop(ctx, obc, rep_tid);
>>
>> issue_repop(repop, ctx);
>> eval_repop(repop);
>> ```
>>
>> As you can see, if we have any writes (all mutations end up in the
>> `op_t` transaction), we just flat out don't do the requested read
>> operations. If we don't have any writes, we perform the read
>> operations and return.  This is justified in the comment above because
>> of the non-deterministic behavior of resent read-write operations.
>>
>> This is not an unsolved problem and we can bootstrap a solution on our
>> existing `Session` infrastructure.
>>
>> ## An upgraded session ##
>>
>> Behold, `OSDSession`:
>> ```c++
>> struct Session : public RefCountedObject {
>>    EntityName entity_name;
>>    OSDCap caps;
>>    int64_t auid;
>>    ConnectionRef con;
>>    WatchConState wstate;
>>    …
>> };
>> ```
>>
>> This structure exists once for every connection to the OSD. Where they
>> are created depends on who is doing the creation. In the case of
>> clients (what we're interested in) it occurs in `ms_handle_authorizeri`
>> ```c++
>> …
>> isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
>>                                                 authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);
>>
>> if (isvalid) {
>>    Session *s = static_cast<Session *>(con->get_priv());
>>    if (!s) {
>>      s = new Session(cct);
>>      con->set_priv(s->get());
>>      s->con = con;
>>      dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
>>    }
>>
>>    s->entity_name = name;
>>    if (caps_info.allow_all)
>>      s->caps.set_allow_all();
>>    s->auid = auid;
>>    …
>> }
>> ```
>>
>> In order to solve this problem, we propose a new data structure,
>> modelled on NFSv4.1
>> ```c++
>> struct OpSlot {
>>    uint64_t seq;
>>    int r;
>>    MOSDOpReplyRef cached; // Nullable
>>    bool completed;
>> };
>> ```
>>
>> We do not want to give the OSD an unbounded obligation to hang on to
>> old message replies: that way lies madness. So, the additions to
>> `Session` we might make are:
>>
>> ```c++
>> struct Session : public RefCountedObject {
>>    …
>>    uint32_t maxslots; // The maximum number of operations this client
>>                       // may have in flight at once;
>>    std::vector<OpSlot> slots // The vector of in-progress operations
>>    ceph::timespan slots_expire; // How long we wait to hear from a
>>                                 // client before the OSD is free to
>>                                 // drop session resources
>>    cepu::coarse_mono_time last_contact; // When (by our measure) we
>>                                         // last received an operation
>>                                         // from the client.
>> };
>> ```
>>
>> ## Message Additions ##
>>
>> The OSD needs to communicate this information to the client. The most
>> useful way to do this is with an addition to `MOSDOpReply`.
>>
>> ```c++
>> class MOSDOpReply : public Message {
>>    …
>>    uint32_t this_slot;
>>    uint64_t this_seq;
>>    uint32_t max_slot;
>>    ceph::timespan timeout;
>>    …
>> };
>> ```
>>
>> This overlaps with the function of the transaction ID, since the
>> slot/sequence/OSD triple uniquely identifies an operation. Unlike the
>> transaction ID, this provides consistent semantics and a measure of
>> flow control.
>>
>> To match our reply, the `MOSDOp` would need to be amended.
>> ```c++
>> class MOSDOp : public Message {
>>    …
>>    uint32_t this_slot;
>>    uint64_t this_seq;
>>    bool please_cache;
>>    …
>> };
>> ```
>>
>> ## Operations ##
>>
>> ### Connecting ###
>>
>> A client, upon connecting to an OSD for the first time should send a
>> `this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
>> should use the `this_slot` and `this_seq` values from before it lost
>> its connection. If an OSD has state for a client and receives a
>> `(slot,seq) = (0,0)` then it should feel free to free any saved state
>> and start anew.
>>
>> ### OSD Feedback ###
>>
>> In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
>> the value from the `MOSDOp` to which we're replying.
>>
>> More usefully, the OSD can inform the client how many operations it is
>> allowed to send concurrently with `max_slot`. The client must **not**
>> send a slot value higher than `max_slot`. (The OSD should error if it
>> does.)
>>
>> The OSD may increase the number of operations allowed in-flight
>> if it has capacity by increasing `max_slot`. If it finds itself
>> lacking capacity, it may decrease `max_slot`. If it does, the client
>> should respect the new bound. (The OSD should feel free to free the
>> rescinded slots as soon as the client sends another `MOSDOp` with a
>> slot value equal to one on which the new `max_slot` has been sent.)
>>
>> If the client sends a `this_seq` lower than the one held for a slot by
>> the OSD, the OSD should error. If it is more than one greater than the
>> current `this_seq`, the OSD should error.
>>
>> ### Caching ###
>>
>> The client is in an excellent position to know whether it **requires**
>> the output of a previous operation of mixed reads and writes on
>> resend, or whether it merely needs the status on resend. Thus, we let
>> the client set `please_cache` to request that the OSD store a
>> reference to the sent message in the appropriate `OpSlot`.
>>
>> The OSD is in an excellent position to know how loaded it is. It can
>> calculate a bound on how large a given reply will be before executing
>> it. Thus, the OSD can send an error if the client has requested it
>> cache something larger than it feels comfortable caching.
>>
>> Assuming no errors, the behavior, for any slot, is this: If the client
>> sends an `MOSDOp` with a `this_seq` one greater than the current value
>> of `OpSlot::seq`, that represents a new operation. Increment
>> `OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
>> the operation finishes, set `OpSlot::completed`. If `please_cache` has been
>> set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
>> result code in `OpSlot::r`.
>>
>> If the client sends an `MOSDOp` with a `this_seq` equal to
>> `OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
>> will reply when it completes.) If it has completed, send the stored
>> `OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
>> with just `OpSlot::r`.
>>
>> ### Reconnection ###
>>
>> Currently the `Session` is destroyed on reset and a new one is created
>> on authorization. In our proposed system the `Session` will not be
>> destroyed on reset, it will be moved to a structure where it can be
>> looked up and destroyed after `timeout` since the last message
>> received.
>>
>> On connection, the OSD should first look up a `Session` keyed
>> on the entity name and create one if that fails.
>
> So the most common time we really get replay operations is when one of
> the OSDs crash or a PG's acting set changes for some other reason.
> Which means these "cached" operation results need to be persisted to
> disk and then cleaned up, a la the pglog.
> I don't see anything in these data structures that explains how we do
> that efficiently, which is the biggest problem and the reason we don't
> already do reply caching. Am I missing something?
>
> And do you think maybe you could split this up into a thread for each
> topic? I'm having trouble digesting it as such a wall of text. :)

Seconded! :D

> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
  2016-04-15 21:05 Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Adam C. Emerson
  2016-04-15 21:25 ` Milosz Tanski
  2016-04-15 22:09 ` Gregory Farnum
@ 2016-04-16  0:07 ` Matt Benjamin
       [not found]   ` <CAFdRU72-CyuFRodb-HoNrBHWZRV7Xj4Ki-yHvxHPKAZeZ213Wg@mail.gmail.com>
  2016-04-18  4:00 ` Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Haomai Wang
  3 siblings, 1 reply; 15+ messages in thread
From: Matt Benjamin @ 2016-04-16  0:07 UTC (permalink / raw)
  To: Adam C. Emerson; +Cc: The Sacred Order of the Squid Cybernetic



----- Original Message -----
> From: "Adam C. Emerson" <aemerson@redhat.com>
> To: "The Sacred Order of the Squid Cybernetic" <ceph-devel@vger.kernel.org>
> Sent: Friday, April 15, 2016 5:05:37 PM
> Subject: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
> 
> Ceph Developers,
> 
> We've put together a few of the main ideas from our previous work in a
> brief form that we hope people will be able to digest, consider, and
> debate. We'd also like to discuss them with you at Ceph Next this
> Tuesday.
> 

> 
> ## The OSD Set ##
> 
> The complicating case here is the OSD status set.  Running this
> through a single Paxos limits the number of OSDs that can coexist in a
> cluster.  We ought split the set of OSDs between multiple masters to
> distribute the load. Each 'Up' or 'Down' event is independent of
> others, so all we require is that events get propagated into the
> correct OSDs and primaries and followers act as they're supposed to.
> 
> Versioning is a bigger problem here. We might have all masters
> increment their version when one increments its version if that could
> be managed without inefficiency. We might send a compound version with
> `MOSDOp`s, but combining that with the compound version above might be
> unwieldly. (Feedback on this issue would be greatly appreciated.)

When Tom Keiser and I considered the problem of distributing AFS3 data
for a single vnode across multiple data servers, iirc, we both arrived
at the notion of compound DataVersion (or "range dv") as the extension
of DataVersion to the partitioned object.

It feels like a similar structure naturally arises here, I admit I have
not thought about this problem in a while. 

Regards,

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAFdRU72-CyuFRodb-HoNrBHWZRV7Xj4Ki-yHvxHPKAZeZ213Wg@mail.gmail.com>]

* Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
       [not found]   ` <CAFdRU72-CyuFRodb-HoNrBHWZRV7Xj4Ki-yHvxHPKAZeZ213Wg@mail.gmail.com>
@ 2016-04-16  0:34     ` Shinobu Kinjo
  2016-04-16  0:52       ` OSD set partitioning Adam C. Emerson
  0 siblings, 1 reply; 15+ messages in thread
From: Shinobu Kinjo @ 2016-04-16  0:34 UTC (permalink / raw)
  To: ceph-devel

> The complicating case here is the OSD status set.  Running this
> through a single Paxos limits the number of OSDs that can coexist in a
> cluster.  We ought split the set of OSDs between multiple masters to

Does multiple masters you mentioned here mean logical OSDs running in
one osd process described in # Interfaces #?

> distribute the load. Each 'Up' or 'Down' event is independent of
> others, so all we require is that events get propagated into the
> correct OSDs and primaries and followers act as they're supposed to.
> 
> Versioning is a bigger problem here. We might have all masters
> increment their version when one increments its version if that could
> be managed without inefficiency. We might send a compound version with
> `MOSDOp`s, but combining that with the compound version above might be
> unwieldly. (Feedback on this issue would be greatly appreciated.)

Cheers,
S

----- Original Message -----
> From: "Adam C. Emerson" <aemerson@redhat.com>
> To: "The Sacred Order of the Squid Cybernetic" <ceph-devel@vger.kernel.org>
> Sent: Friday, April 15, 2016 5:05:37 PM
> Subject: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
>
> Ceph Developers,
>
> We've put together a few of the main ideas from our previous work in a
> brief form that we hope people will be able to digest, consider, and
> debate. We'd also like to discuss them with you at Ceph Next this
> Tuesday.
>

>
> ## The OSD Set ##
>
> The complicating case here is the OSD status set.  Running this
> through a single Paxos limits the number of OSDs that can coexist in a
> cluster.  We ought split the set of OSDs between multiple masters to
> distribute the load. Each 'Up' or 'Down' event is independent of
> others, so all we require is that events get propagated into the
> correct OSDs and primaries and followers act as they're supposed to.
>
> Versioning is a bigger problem here. We might have all masters
> increment their version when one increments its version if that could
> be managed without inefficiency. We might send a compound version with
> `MOSDOp`s, but combining that with the compound version above might be
> unwieldly. (Feedback on this issue would be greatly appreciated.)

When Tom Keiser and I considered the problem of distributing AFS3 data
for a single vnode across multiple data servers, iirc, we both arrived
at the notion of compound DataVersion (or "range dv") as the extension
of DataVersion to the partitioned object.

It feels like a similar structure naturally arises here, I admit I have
not thought about this problem in a while.

Regards,

Matt

--
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Email:
shinobu@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: OSD set partitioning
  2016-04-16  0:34     ` Shinobu Kinjo
@ 2016-04-16  0:52       ` Adam C. Emerson
  0 siblings, 0 replies; 15+ messages in thread
From: Adam C. Emerson @ 2016-04-16  0:52 UTC (permalink / raw)
  To: ceph-devel

On 15/04/2016, Shinobu Kinjo wrote:
> Does multiple masters you mentioned here mean logical OSDs running in
> one osd process described in # Interfaces #?

Those are two separate things. When I refer to 'multiple masters' I'm speaking
of the monitor service.

Currently, Ceph has a single set of monitors that acts as a group. One is the
master and the rest replicate all changes made by the master.

Here, we propose having several groups of monitors, each group with its own
master. Every group would be responsible for some subset of OSDs, determined by
partitioning the space of OSD ids up between them.

Each LogicalOSD would be tracked separately in the map. You should, in
principle, be able to back up an LogicalOSD's store, move it to another machine,
and load it into a process with a different set of LogicalOSDs the same way you
can, now, move one ceph-osd process from one physical machine to another.

-- 
Senior Software Engineer           Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)
  2016-04-15 21:05 Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Adam C. Emerson
                   ` (2 preceding siblings ...)
  2016-04-16  0:07 ` Matt Benjamin
@ 2016-04-18  4:00 ` Haomai Wang
  3 siblings, 0 replies; 15+ messages in thread
From: Haomai Wang @ 2016-04-18  4:00 UTC (permalink / raw)
  To: The Sacred Order of the Squid Cybernetic

Awesome .... I'm surprised that I have read the whole thing...

On Sat, Apr 16, 2016 at 5:05 AM, Adam C. Emerson <aemerson@redhat.com> wrote:
> Ceph Developers,
>
> We've put together a few of the main ideas from our previous work in a
> brief form that we hope people will be able to digest, consider, and
> debate. We'd also like to discuss them with you at Ceph Next this
> Tuesday.
>
> Thank you.
>
>
> ---8<---
>
>
> We have been looking at improvements to Ceph, particularly RADOS,
> while focusing on flexibility (allowing users to do more things)
> and performance. We have come up with a few proposals with these two
> things in mind. Sessions and read-write transactions aim to allow
> clients to batch up multiple operations in a way that is safe and
> correct, while allowing clients to gain the advantages of atomic
> read-write operations without having to lock. Sessions also provide
> a foundation for flow-control which ultimately improves performance
> by preventing an OSD from being ground into uselessness under a
> storm of impossible requests. The CLS proposal is a logical follow-on
> from the read-write proposal, as we attempt to address some problems
> of correctness that exist now and consider how to integrate the
> facility into an asynchronous world.
>
> Flexible Placement, as you would expect from the name, is about
> allowing users more control, as are Flexible Semantics. They both
> have profound performance implications, as tuning placement to better
> match a workload can increase throughput, and relaxed consistency can
> decrease latency. The proposed Interfaces are meant to support both as
> well as work currently being done to allow an asynchronous OSD and to
> hide details like locking and thread pools so that backends can be
> written with different forms of concurrency and load-balancing
> across processors.
>
> Finally, Map Partitioning is not directly related to code paths within
> the OSD itself, but does affect everything that can be done with Ceph.
> People are beginning to run into limits on how large a Ceph cluster can
> grow and how many ways they can be partitioned, and both these problems
> fundamentally derive from the way the OSD map is handled by the monitors.
>
> There are also some notes at the end. They are not critical, but if you
> find yourself asking "What were they thinking?" the notes might help.
>
> # Sessions and Read-Write #

Hmm, I can imagine the complexity from this idea... Agree with sage's
point, if we think current read/write isn't fast enough, and why the
solution is introduce read/write transaction.. Could we target to make
read/write message more lightweight? If we want to own some atomic
composed ops, we may introduce some helper interface just like CAS. I
don't think the rados client really need a complete transaction
interface..

plz correct me if I'm missing something else.

I think we could normal osd op faster and faster, a complete different
OSDOp from current impl? Except the performance refactor for io path,
we need to consider to reduce osd/pg preprocess jobs.

>
> From `ReplicatedPG.cc`.
>
> ```c++
> // Write operations aren't allowed to return a data payload because
> // we can't do so reliably. If the client has to resend the request
> // and it has already been applied, we will return 0 with no
> // payload.  Non-deterministic behavior is no good.  However, it is
> // possible to construct an operation that does a read, does a guard
> // check (e.g., CMPXATTR), and then a write.  Then we either succeed
> // with the write, or return a CMPXATTR and the read value.
> …
> if (ctx->op_t->empty() || result < 0) {
>   …
>   if (ctx->pending_async_reads.empty()) {
>     complete_read_ctx(result, ctx);
>   } else {
>     in_progress_async_reads.push_back(make_pair(op, ctx));
>     ctx->start_async_reads(this);
>   }
>   return;
> }
> …
> // issue replica writes
> ceph_tid_t rep_tid = osd->get_tid();
>
> RepGather *repop = new_repop(ctx, obc, rep_tid);
>
> issue_repop(repop, ctx);
> eval_repop(repop);
> ```
>
> As you can see, if we have any writes (all mutations end up in the
> `op_t` transaction), we just flat out don't do the requested read
> operations. If we don't have any writes, we perform the read
> operations and return.  This is justified in the comment above because
> of the non-deterministic behavior of resent read-write operations.
>
> This is not an unsolved problem and we can bootstrap a solution on our
> existing `Session` infrastructure.
>
> ## An upgraded session ##
>
> Behold, `OSDSession`:
> ```c++
> struct Session : public RefCountedObject {
>   EntityName entity_name;
>   OSDCap caps;
>   int64_t auid;
>   ConnectionRef con;
>   WatchConState wstate;
>   …
> };
> ```
>
> This structure exists once for every connection to the OSD. Where they
> are created depends on who is doing the creation. In the case of
> clients (what we're interested in) it occurs in `ms_handle_authorizeri`
> ```c++
> …
> isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
>                                                authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);
>
> if (isvalid) {
>   Session *s = static_cast<Session *>(con->get_priv());
>   if (!s) {
>     s = new Session(cct);
>     con->set_priv(s->get());
>     s->con = con;
>     dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
>   }
>
>   s->entity_name = name;
>   if (caps_info.allow_all)
>     s->caps.set_allow_all();
>   s->auid = auid;
>   …
> }
> ```
>
> In order to solve this problem, we propose a new data structure,
> modelled on NFSv4.1
> ```c++
> struct OpSlot {
>   uint64_t seq;
>   int r;
>   MOSDOpReplyRef cached; // Nullable
>   bool completed;
> };
> ```
>
> We do not want to give the OSD an unbounded obligation to hang on to
> old message replies: that way lies madness. So, the additions to
> `Session` we might make are:
>
> ```c++
> struct Session : public RefCountedObject {
>   …
>   uint32_t maxslots; // The maximum number of operations this client
>                      // may have in flight at once;
>   std::vector<OpSlot> slots // The vector of in-progress operations
>   ceph::timespan slots_expire; // How long we wait to hear from a
>                                // client before the OSD is free to
>                                // drop session resources
>   cepu::coarse_mono_time last_contact; // When (by our measure) we
>                                        // last received an operation
>                                        // from the client.
> };
> ```
>
> ## Message Additions ##
>
> The OSD needs to communicate this information to the client. The most
> useful way to do this is with an addition to `MOSDOpReply`.
>
> ```c++
> class MOSDOpReply : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   uint32_t max_slot;
>   ceph::timespan timeout;
>   …
> };
> ```
>
> This overlaps with the function of the transaction ID, since the
> slot/sequence/OSD triple uniquely identifies an operation. Unlike the
> transaction ID, this provides consistent semantics and a measure of
> flow control.
>
> To match our reply, the `MOSDOp` would need to be amended.
> ```c++
> class MOSDOp : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   bool please_cache;
>   …
> };
> ```
>
> ## Operations ##
>
> ### Connecting ###
>
> A client, upon connecting to an OSD for the first time should send a
> `this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
> should use the `this_slot` and `this_seq` values from before it lost
> its connection. If an OSD has state for a client and receives a
> `(slot,seq) = (0,0)` then it should feel free to free any saved state
> and start anew.
>
> ### OSD Feedback ###
>
> In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
> the value from the `MOSDOp` to which we're replying.
>
> More usefully, the OSD can inform the client how many operations it is
> allowed to send concurrently with `max_slot`. The client must **not**
> send a slot value higher than `max_slot`. (The OSD should error if it
> does.)
>
> The OSD may increase the number of operations allowed in-flight
> if it has capacity by increasing `max_slot`. If it finds itself
> lacking capacity, it may decrease `max_slot`. If it does, the client
> should respect the new bound. (The OSD should feel free to free the
> rescinded slots as soon as the client sends another `MOSDOp` with a
> slot value equal to one on which the new `max_slot` has been sent.)
>
> If the client sends a `this_seq` lower than the one held for a slot by
> the OSD, the OSD should error. If it is more than one greater than the
> current `this_seq`, the OSD should error.
>
> ### Caching ###
>
> The client is in an excellent position to know whether it **requires**
> the output of a previous operation of mixed reads and writes on
> resend, or whether it merely needs the status on resend. Thus, we let
> the client set `please_cache` to request that the OSD store a
> reference to the sent message in the appropriate `OpSlot`.
>
> The OSD is in an excellent position to know how loaded it is. It can
> calculate a bound on how large a given reply will be before executing
> it. Thus, the OSD can send an error if the client has requested it
> cache something larger than it feels comfortable caching.
>
> Assuming no errors, the behavior, for any slot, is this: If the client
> sends an `MOSDOp` with a `this_seq` one greater than the current value
> of `OpSlot::seq`, that represents a new operation. Increment
> `OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
> the operation finishes, set `OpSlot::completed`. If `please_cache` has been
> set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
> result code in `OpSlot::r`.
>
> If the client sends an `MOSDOp` with a `this_seq` equal to
> `OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
> will reply when it completes.) If it has completed, send the stored
> `OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
> with just `OpSlot::r`.
>
> ### Reconnection ###
>
> Currently the `Session` is destroyed on reset and a new one is created
> on authorization. In our proposed system the `Session` will not be
> destroyed on reset, it will be moved to a structure where it can be
> looked up and destroyed after `timeout` since the last message
> received.
>
> On connection, the OSD should first look up a `Session` keyed
> on the entity name and create one if that fails.
>
> # Read as a part of Transaction #
>
> We don't have code examples here since most of the obvious interface
> changes are obvious. Codes and parameters would be added to
> `PGBackend::Transaction` and executing a transaction would have to
> return data.
>
> ## Motivation ##
>
> -   Mixed reads and writes are an efficiency win, since a client can
>     save round trips by batching up operations in a single request.
>     Current Ceph does not allow them for reasons which are quoted and
>     addressed in the preceding section.
> -   Mixed reads and writes are a semantic win. If an `MOSDOp` is
>     atomic (it is in current Ceph), read-after-write can often remove
>     the need for explicit locking.
> -   Transactional reads may seem complicated, but the Erasure Coding
>     backend already has to execute complex read transactions to
>     reassemble or recover data. We want an asynchronous read capability
>     in the Store anyway and there's no reason not to have it be shared
>     with our asynchronous write path.
> -   While it might seem that separating reads and writes, as we do
>     now, allows us to simplify code and rule out edge cases, we would
>     like to point out the existence of CLS, which can have problems if
>     two method calls occur in the same `MOSDOp`.
>
> ## Sketch ##
>
> The main problem with mixed read-write transactions is that replicas
> need to write but not read. The key to handling this is dependency
> checking. Outside CLS (which will be discussed below) it is very easy
> to see whether reads and writes are independent. (Simply go down the
> ops and see if their ranges overlap and whether getattrs and setattrs
> have keys in common.) Reads coming after overlapping writes depend on
> the previous writes. Then:
> -   If an op that's all reads, simply do all the reads. We don't have
>     to get write locks or anything.
> -   If an op is all writes, it's no different than a replicated
>     operation now.
> -   For mixed reads and writes, if the reads aren't dependent on the
>     writes, dispatch the writes and do the reads before, after, or
>     concurrently with the writes on the primary.  (So long as we
>     prevent writes from other transactions from intervening.)
> -   Dependent reads are the difficult case. For erasure coding it
>     shouldn't any difference since we'd have to dispatch reads and
>     writes to all stripes anyway. For replication, we would want to
>     execute the mixed read-write transaction on the local store in
>     strict order and dispatch one consisting of only writes to the
>     remotes.
>
> # CLS #
>
> ## Current Problem ##
>
> The CLS API works by making an ops vector and handing it to
> `do_osd_ops`.
>
> ```c++
> int cls_cxx_getxattr(cls_method_context_t hctx, const char *name,
>                      bufferlist *outbl)
> {
>   ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
>   bufferlist name_data;
>   vector<OSDOp> nops(1);
>   OSDOp& op = nops[0];
>   int r;
>
>   op.op.op = CEPH_OSD_OP_GETXATTR;
>   op.indata.append(name);
>   op.op.xattr.name_len = strlen(name);
>   r = (*pctx)->pg->do_osd_ops(*pctx, nops);
>   if (r < 0)
>     return r;
>
>   outbl->claim(op.outdata);
>   return outbl->length();
> }
>
> int cls_cxx_setxattr(cls_method_context_t hctx, const char *name,
>                      bufferlist *inbl)
> {
>   ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
>   bufferlist name_data;
>   vector<OSDOp> nops(1);
>   OSDOp& op = nops[0];
>   int r;
>
>   op.op.op = CEPH_OSD_OP_SETXATTR;
>   op.indata.append(name);
>   op.indata.append(*inbl);
>   op.op.xattr.name_len = strlen(name);
>   op.op.xattr.value_len = inbl->length();
>   r = (*pctx)->pg->do_osd_ops(*pctx, nops);
>
>   return r;
> }
> ```
>
> The `do_osd_ops` function performs reads inline, synchronously, right
> then and there for replicated pools. (Erasure coded pools are more
> limited.) Writes are batched up and added to the transaction
> associated with the current `OpContext`.
>
> This is bad. If one has a CLS method that performs a read-modify-write
> and one calls it twice in the same `MOSDOp`, it becomes a
> read-modify-read-modify-write-write which may produce incorrect
> results.
>
> ## Desiderata ##
>
> -   CLS operations should be composable. We should be able to have many
>     of them in a single operation.
> -   They should remain transactional. If a CLS operation does some
>     reads and hits an error, it stops and nothing is written to the
>     store. We should not allow situations where a CLS method can write
>     a partial result to the store then error.
> -   They should be capable. We should not put too many restrictions on
>     what an operation is allowed to do. It should be possible to run
>     them on Erasure Coded Pools once ECOverwrite is in place. (At
>     least some subset of them).
> -   They should be consistent. A CLS operation should be able to call
>     rand or generate a UUID without each replica holding a different
>     value. (This rules out solutions like calling the method on each
>     replica.)
> -   They should be efficient and optimizable.
> -   They should work in an asynchronous framework.
>
> There are several ways we could change their implementation to address
> these.
>
> ## Futures ##
>
> This is an attractive way to think about CLS. It allows things to
> proceed asynchronously and would solve the RMRMWW problem. One would
> simply make every I/O operation in the CLS API a call returning a
> future and write each method in continuation passing style. Executing
> the transaction in the primary OSD (on a replicated pool) would create
> a write-only that could then be sent to replicas. (Having the
> execution of a CLS method also compile a write-only transaction is a
> propery of any composable design.)
> -   Tracking dependencies before the operation is executed would be
>     problematic. There would be no way to know whether later reads
>     overlapped with previous writes before doing them. This could lead
>     to an unbounded obligation on the OSD to maintain state to
>     evaluate OSDOp, including potentially large writes, before
>     actually committing in order for CLS methods to remain
>     transactional.
>
> Futures are, on their own, insufficient to provide everything we need
> from CLS, largely because they are opaque to the OSD. They could be
> combined with…
>
> ## Pre-declaration ##
>
> We could remove some of the generality. A simple way of doing so would
> be to have methods declare, as a part of their signature, everything
> that they may ever read or write, with the expectation that methods
> will name the fewest resources required. This doesn't mean that every
> method will always write to and read from everything it mentions,
> merely that we have a known bound of the maximum it will ever use.
>
> This makes analysis easier for the OSD, and in the composition case,
> it could go in two passes. In the first, it would execute CLS calls
> and pre-stage results and in the second it would pass its compiled
> write transaction into the store.
>
> This is the most attractive solution, but depends on pre-declaration
> being done well on the resources used in pre-staging.
>
> One could make things easier by being even more restrictive and
> imposing ordering:
> 1.  The method declares in advance all read operations it might ever
>     perform.
> 2.  The method declares in advance all write operations it might ever
>     perform.
> 3.  The method examines the parameters passed by the client and
>     indicates which subset of the named inputs and outputs it will
>     use.
> 4.  The method performs its read operations and denotes exactly which
>     output operations it will perform. (not the data to be output, but
>     ranges and names.)
> 5.  The method performs write operations.
>
> The most restrictive form of this would operate in two phases. First,
> the CLS method would be presented with its parameters and all of the
> things it plans to read or write (objects with ranges and attribute
> keys.) In the second it would be called with the contents of all the
> reads it requested and supply the data for all the writes it requested.
>
> This would obviate the need for futures or other asynchronous I/O,
> and make evaluation very easy. This approach would disallow some
> operations, like indirecting through an attribute key to read another,
> but is very appealing.
>
> ## Be Transactional ##
>
> Our transactions are pre-checked and must succeed. If we want the most
> expressive version of CLS consistent with our other goals, then we
> should add commit and rollback. EC Overwrite will already require some
> form of commit and rollback, so it's not beyond the realm of thought.
>
> It could also be a foundation for some future multi-object-transaction
> supporting backend.
>
> This idea might have appeal on its own, but the concerns of CLS are
> not sufficient to motivate it.
>
> ## Domain Specific Language ##
>
> One could make a domain specific language, based on something simple,
> that the OSD can execute to perform CLS methods. The OSD could then
> analyze each method to see what I/O operations a method calls and try
> to track them
> -   Dependency tracking for compilers is a major area of research. It
>     would be a whole lot of fun, but as a short term solution it is
>     not really practical.
> -   We still wouldn't be able to rule out problems in the general
>     case.
>
> This approach would be interesting as a long term academic research
> project, but is not suitable for a short-range improvement.
>
> # Flexible Placement #
>
> This is a large topic which should be discussed on its own, but it
> motivates the interface designs below, so we shall briefly mention why
> it's interesting.
>
> CRUSH/PG is a fine placement system for several workloads, but it has
> two well-known limitations.
>
> ## Motivation ##
>
> -   Data distribution can be much less uniform than one might like,
>     giving uneven use of disks. This has caused some Ceph developers
>     to experiment with Monte Carlo based placement algorithms.
> -   Data distribution can be much more uniform than one would
>     like. This is the fundamental cause of Ceph's slow sequential read
>     performance. More generally, unrelated workloads contend
>     with each other due to a lack of affinity for related data. The effects are
>     especially pronounced on spinning disk (due to seek times), but
>     still exist on Flash (due to bus/network contention.)  This is a
>     tension between competing goods. CRUSH gains wide dispersion and
>     uniformity to defend against correlated failures but this imposes
>     a tradeoff.
>
> ## Goal ##
>
> Ceph should support placement methods other than CRUSH/PG. Currently,
> the OSD dispatches operations based on placement group ID, which will
> need to be varied,
>
> We also need some way to get new types of functions into the cluster.
>
> ## Proposal ##
>
> Our proposal is, in a way, CRUSH taken to its logical
> conclusion. Instead of distributing CRUSH rules, we propose to
> distribute general computable functions from (oid, volume/dataset) pairs to
> sequences of OSDs with their supporting data structures.  One of our
> ongoing research projects has been an in-process executor for these
> functions based on Google's NaCl. The benefits are:
> -   Administrators can fine-tune placement functions to fit their
>     workloads well.
> -   They can also experiment easily without having to recompile all of
>     Ceph and make heavy architectural changes.
> -   Entirely new placement strategies can be deployed without having
>     to upgrade every machine in the cluster. Or any machine in the
>     cluster, once they've been upgraded to a Flexible Placement
>     capable version.
> -   Possibilities for annealing and machine learning to gradually
>     adapt placement in response to load data become available
> -   NaCl builds on LLVM which has a rich set of tools for optimizations
>     like partial evaluation.
> -   NaCl is fast.
>
> # Flexible Semantics #
>
> Another motivating example. Originally, Ceph did replication and only
> replication under a very specific consistency model. There has been
> desire for more flexibility.
> -   Erasure Coding. it still follows the Ceph consistency model
>     (though leaves out many operations) but is very different in
>     back-end dispatch, enough so that it inspired a major rewrite of
>     the OSD's bottom half.
> -   Append-only immutable objects have been discussed.
> -   Many people have asked for relaxed consistency to improve
>     performance. This is not be suitable for all workloads, but people
>     have repeatedly asked for the ability to set up low-latency,
>     relaxed-consistency volumes that still provide Ceph's ability to
>     easily use new storage and scale well.
> -   Transactional storage. As mentioned above, cross-object
>     transactional semantics are a thing people may have desired.
>
> # Interfaces #
>
> Right now our class hierarchy is a bit of a mess. Eventually we'll do
> something about `PG` and `ReplicatedPG`, refactor, support
> asynchronous I/O, reduce lock contention, support in core affinity,
> and build Jerusalem here in England's green and pleasant land.
>
> While we're stringing up our bows of burning gold, we should support
> non-PG based placement and flexible semantics. Right now, parts of the
> PG and the OSD (since the OSD manages the collection of PGs, spins
> them up, and manages thread pools shared by sets of PGs) are
> intertwined. Thus, we need to abstract out both pieces.
>
> As we also want to support having multiple "logical" OSDs running in a
> single `ceph-osd` process, this would be a natural time to add that
> capability.
>
> Both these are sketches and should be considered a work in progress.
>
> ## `DataSetInterface` ##
>
> Here is a sketch of what a flexible abstraction based on PG could look
> like, at least parts of one. Not being informed about Scrub,
> Recovery, or Cache Tiering, having only focused on the object
> operation path, we won't include those details here.
>
> We also leave out functions called from the PG itself or other objects
> invoked from ownstack.
>
> ```c++
> class DataSetInterface {
> protected:
>   LogicalOSD& losd; // LogicalOSD is a means to have different
>                     // stores/semantics run in the same process.
>
>   MapPartRef curmap; // Subset of map relevant to this DSI
> public:
>   // The OSD (things Up the Stack, generally) should not call 'lock'
>   // on us. If we have locking of some sort things down the stack that
>   // we have some relationship with (friend or whatever) could lock or
>   // unlock us, but that should not be baked in as part of the interface.
>
>   // Things like the info struct and details about loading the Place
>   // wouldn't actually be here. As there is an intimate relation
>   // between the LogicalOSD and an implementation of DataSetInterface (it
>   // holds all those loaded in memory and controls dispatch), they
>   // would not need to be part of the generic interface.
>
>   const coll_t coll; // The subdivision of the Store we control
>
>   // In the PG case we always know we're the primary or not for
>   // anything within the same pgid. That is not expected to be the
>   // case generally.
>   bool is_primary(const OpRequest&) = 0;
>   // No 'is_replica' since 'replica' may not be applicable
>   // generally. It's a bit off even in the erasure coded case.
>   bool is_acting(const OpRequest&) = 0;
>   bool is_inactive() = 0;
>
>  public:
>   // No identifier. The descendent will take that.
>   DataSetInterface(LogicalOSD& o, OSDMapRef curmap);
>   virtual ~DataSetInterface();
>
>   DataSetInterface(const DataSetInterface&) = delete;
>   DataSetInterface& operator =(const DataSetInterface&) = delete;
>   DataSetInterface(DataSetInterface&&) = delete;
>   DataSetInterface& operator =(DataSetInterface&&) = delete;
>
>   virtual void on_removal(ObjectStore::Transaction *t) = 0;
>
>   // Yes, there's no 'queue' and no 'do_op' or any of
>   // that. This is intentional. There's no dequeue or do_op because
>   // those functions are either called only by the PG currently OR
>   // they're called in OSD functions called by the PG as part of the
>   // thread switch. They should not be part of the public interface.
>
>   // There's no queue because we can either put queue here or we can
>   // put queue in LogicalOSD. (We could do both, but that seems bad to
>   // me.) If there is some combination of locking and checking that
>   // must be done before queueing an operation, it seems that it's
>   // better to do it in LogicalOSD so that it doesn't leak out and
>   // become part of the abstraction for other implementations.
> };
> ```
>
> ## `LogicalOSD` ##
>
> The OSD class itself (representing the single OSD process) should have
> a map (*perhaps* a Boost.Intrusive.Set?) mapping OSD IDs to to
> `LogicalOSD` instances.
>
> ```c++
> class LogicalOSD {
>   OSD& osd;
>   ObjectStore& store;
>
>   // Look up the DataSetInterface instance appropriate to the given
>   // OpRequest.
>   virtual future<DataSetInterface,int> get_place_for(const OpRequest&) = 0;
>
>   // Every logical OSD will have its own watchers as well as slot
>   // cache. Someone familiar with flow control should check this
>   // idea. Since LogicalOSDs will, ideally, share messengers we might
>   // want them to share the same slot cache. In that case we should
>   // just re-dimension watchers within Session
>   SessionRef session_for(const entity_name_t& name);
>
>   void queue(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);
>   void queue_front(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);
>
>   // Dequeue and the like are currently called in the PG itself and so
>   // have no place in the interface presented to the OSD.
>
>   void pause();
>   void resume();
>   void drain();
> };
> ```
>
> ## Library ##
>
> Both these interfaces are quite thin and intentionally so. Scrubbing
> and recovery have not been addressed at all, as mentioned, so those
> parts will be expanded.  Asynchrony should allow us simpler interfaces
> since some complexity of requeing will be handled by futures and
> continuations.
>
> We obviously do not want to rewrite all our existing code. Instead
> most of the existing work on `PG` and `ReplicatedPG` should be
> refactored into a templated library from which implementations of
> `LogicalOSD` and `DataSetInterface` can be constructed.
>
> # Map Partitioning #
>
> There are two huge problems with scalability in Ceph.
> 1.  The OSDMap knows too many things
> 2.  A single monitor manages all updates of everything and replicates them to
>     other monitors.
>
> ## Too Big to Not Fail ##
>
> The monitor map and MDS maps are fine. Each holds data needed to
> locate servers and that's it. It would be very hard to put enough data
> in them to cause problems. The OSD map however contains a trove of data that
> must be updated serially in Paxos and propagated to every OSD,
> monitor, MDS, and client in the cluster.
>
> Pools are a notorious example. We can't create as many pools as users
> would like. Pools are heavyweight, and while they depend on other
> items in the OSD map (like erasure code profiles), it would be nice if
> we divide them between several monitor clusters, each of which would
> hold a subset of pools. We would need to make sure that clients had up
> to date versions of whatever pools they are using along with the
> status of the OSDs they're speaking to, but that's not
> impossible. Likewise, we should split placement rules out of the OSD
> map, especially once we get into larger numbers of potentially larger
> Flexible Placement style functions.
>
> Nodes should then only need to subscribe to the set of pools and
> placement functions they need to access their data. Changes like these
> should allow users to create the number of pools they want without
> causing the cluster difficulty.
>
> ### Consistency ###
>
> Partitioning makes consistency harder. A simple remedy might be to
> stop referring to data by name or integer. An erasure code profile
> should be specified by UUID and version. So should pools and placement
> functions. When sending a request to the OSD, a client should send the
> versions of the pool, the ruleset, and the OSDMap it used and the OSD
> should check that all three are current.
>
> ## The OSD Set ##
>
> The complicating case here is the OSD status set.  Running this
> through a single Paxos limits the number of OSDs that can coexist in a
> cluster.  We ought split the set of OSDs between multiple masters to
> distribute the load. Each 'Up' or 'Down' event is independent of
> others, so all we require is that events get propagated into the
> correct OSDs and primaries and followers act as they're supposed to.
>
> Versioning is a bigger problem here. We might have all masters
> increment their version when one increments its version if that could
> be managed without inefficiency. We might send a compound version with
> `MOSDOp`s, but combining that with the compound version above might be
> unwieldly. (Feedback on this issue would be greatly appreciated.)
>
> ### Subscription ###
>
> For a large number of OSDs, it would be nice if not everyone were
> notified of all state changes.
>
> For a pool whose placement rule spans only a subset of all OSDs,
> clients using that pool should be able to subscribe to a subset of the
> OSD set corresponding to that pool. This should be fairly easy so long
> as the subset is explicit.
>
> In the case of pools not providing an explicit subset, a monitor (or
> perhaps a proxy in front of a set of monitors) could look at common
> patterns of subscription requests and merge those with significant
> overlap together, so as to give clients a subset without being
> destroyed by the irresistible force of combinatorial explosion.
>
> # Notes #
>
> These are notes taken when reviewing the code and thinking out
> ideas. You don't have to read them, but they are provided as a
> supplement in case you wanted to know what we were thinking and why.
>
> ## ShardedOpWQ ##
>
> -   What is the purpose of `sdata_op_ordering_lock`? A shard is not a
>     PG, so why do things need to be ordered within shards as well as
>     within PGs?
> -   `sdata_lock` pairs up with the condition variable
>
> ## OSD Upper Half ##
>
> ### Regular Dispatch ###
>
> -   Does not overlap with `fast_dispatch`. Operations in
>     `ms_can_fast_dispatch` are not handled in `_dispatch` and vice versa.
> -   Lock the entire OSD
> -   If another dispatch is executing, go to sleep and wait for it to
>     finish. What the heck?
> -   Do Waiters
>     * Waiters are a list of `OpRequestRef`s called `finished` for some
>           reason
>     * Whenever we activate an `OSDMap` the requests waiting for the
>       map get put onto 'finished'
> -   Call `_dispatch`
> -   Do some more waiters
> -   Wake up other dispatch threads
> -   Unlock the entire OSD
>
> #### `_dispatch`? ####
>
> A giant case statement that does a bunch of things.
>
> In the case of `OSDOp`, if we have an `OSDMap`, create an `OpRequest` and
> pass it to `dispatch_op`. This is for things like PG commands, not
> actual object operations.
>
> #### `dispatch_op` ####
>
> Another giant case statement.
>
> ### Fast Dispatch ###
>
> #### `ms_fast_preprocess` ####
>
> Update the map epoch if an OSD sends us an OSDMap.
>
> #### `ms_fast_dispatch` ####
>
> -   Make an `OpRequest`
> -   A bit weird and convoluted, it looks like we use the 'op waiting
>     for map' stuff to queue up an op on a reserved map and remove the
>     reservation preventing it from running before we return.
> -   Specifically we mark the op as waiting for its PG in the `Session`
>     and then mark the `Session` as waiting for the new map.
> -   Ultimately things end up in `dispatch_op_fast`
>
> #### `dispatch_op_fast` ####
>
> Shovels operations into type specific calls like…
>
> #### `handle_op` ####
>
> -   Set up map share (if needed)
> -   Calculate the True PGID and Pool (sanity check against the client?)
> -   Either get the pointer to the PG (a base class) or, if it hasn't
>     been loaded in, queue the session to wait for it
> -   If we have the PG, `enqueue_op` (which just calls `PG::queue_op`)
>
> ## OSD Lower Half (Currently PG) ##
>
> `ReplicatedPG` and `PG` are separate for historical reasons and actual
> differentiation occurs in choice of backend according to Word of Sam.
>
> PGs with different consistency properties are explicitly a goal
> now. The idea of a `PGInterface` has been floated to facilitate their
> creation and `ReplicatedPG` would become a child of that.
>
> ### `PG::queue_op` ###
>
> -   Delay if other people are waiting for maps (to preserve the PG Ordering)
> -   Enqueue in `op_wq` (owned by the OSD)
> -   (Why call into the PG then? Just to enforce the map ordering?)
> -   The work queue gathers operations which, during `_process` are later
>     reassembled into a list of work to be done.
> -   `_process` is called by a worker thread in the thread pool, so the
>     call to `dequeue_op` is in worker thread. Since it's sharded, we
>     get multiple groups of threads each serving some subset of PGs.
>
> ### `OSD::dequeue_op` ###
>
> -   After a bit of fiddling about sharing maps, call `PG::do_op`
>
> ### `PG::do_op` ###
>
> -   Sam says he plans to rewrite this to allow read asynchrony
> -   We want to see reads and writes share the same transaction
>     structure and similar semantics.
> -   We also want to allow reads and writes in the same operation and
>     to use a session mechanism to allow that.
> -   We'll need transaction transforms to, for example, filter out
>     reads before sending an operation to a replicating OSD. This
>     shouldn't be too hard, since the output of read operations can't
>     be the input for write operations. (Except in CLS?)
> -   `do_op` is a virtual function, but the only implementation is in
>     ReplicatedPG.
> -   Here looks to be where we apply ordering to Writes
> -   `execute_ctx` actually performs the operations after `do_op` has set
>     everything up
>
> ### `execute_ctx` ###
>
> -   May be called multiple times on the same `OpContext`
> -   In the case of clone operations (that's the only thing that takes
>     `src_obc`?), get a read-lock for the object context
> -   it's called `ondisk` but I'm not sure why, it doesn't look like they get serialized
> -   Then we have a brief detour into `prepare_transatcion`
> -   Here's the read-write restriction. ReplicatedPG.cc:2975. Later we can
>     create a better session abstraction to fix that.
> -   For reads
>     *   `do_osd_op_effects`!
>     *   If all our reads were synchronous (or there were none)
>         `complete_read_ctx`, which creates and sends the reply
>     -   Otherwise, `start_async_reads`, which passes the pending reads off
>         to `objects_read_async`
>     -   Once the backend completes, we go to `finish_read`, which calls
>         `complete_read_ctx`
> -   Trim the PG Log
> -   Hey, cool, there's a lambda! Register an `onack` closure that sends a reply
> -   And `oncommit`. And `onsuccess`. And `finish`.
> -   Package up the `OpContext` and its transaction and whatnot into a
>     `RepOp`. This is where all the mutations get done.
> -   Call `issue_repop`
> -   Call `eval_repop`
> -   Adam really wishes we would use `boost::intrusive_ptr` everywhere
>     and stop using explicit gets and puts.
>
> ### `prepare_transaction` ###
>
> -   `do_osd_ops`!
> -   If we're not full, `finish_ctx`.
>
> ### `do_osd_ops` ###
>
> -   Loop over the ops in a gigantic case statement
> -   If we hit any modification ops set the `user_modify` flag. This is
>     used to update the object version as part of the transaction
> -   On EC pools, do reads asynchronously, pushing them onto a list of
>     reads to complete.
> -   Otherwise do the reads synchronously
> -   CLS calls can be tricky since they read or write depending on the
>     method invoked
> -   It looks like operations performed by CLS are done by calling each
>     operation individually with `do_osd_ops` with reads being done
>     immediately and writes being queued up as part of the transaction
> -   Making the CLS API futures-based interface may be a good thing to do.
> -   Cache ops like flushing seem to be about shovelling triggers to do
>     perform actions into the `onack`/`oncomplete` lists.
> -   For write operations, stuff them into the Transaction
> -   In the case of CLS operations which do both reads and writes
>     (which some of them do), it appears that putting two CLS operations
>     in the same OSDOp might lead to weird results since all the reads
>     will happen then all the writes.
>
> ### `finish_ctx` ###
>
> -   Fiddle with object state and logs to update snapshot foo and to make
>     sure the object exists in the form we need it
> -   Update user version if we modified the object
> -   Save the updated `object_info_t`
> -   Append the updated object info to the `PGLog`
> -   Apply context stats
>
> ### `do_osd_op_effects`
>
> -   Add watches if we need to add watches
> -   If there's notifies, notify the watchers
> -   Why do we ack notifies?
>
> ### `issue_repop` ###
>
> -   Acquire locks (I'm still not clear why they're called `ondisk`. Is
>     it a lock acquired to use the store and thus it locks the on-disk
>     representation?)
> -   Apply built up attributes (likely verions and things that had been
>     stuck in the PGLog before.)
> -   Submit transaction to the PG Backend. Which is where it gets
>     divided up for Erasure Coding or sent out for replication. I'll
>     count that as Bottom End for the moment alongside the Store,
>     Changes to the backend will be for new consistency models.
>
>     We might be able to get a separation of concerns by varying what
>     is now ReplicatedPG to support differnet 'gridding' of objects on
>     the OSD and rejigger things so the consistency model is purely a
>     property of the backend. That's appealing from a maintenance
>     perspective, but breaks down if we want things like explicitly
>     marked transactions across multiple for some volumes while not
>     paying for them on others. It might not be workable in the general
>     case.
> -   That's also where local application takes place.
>
> ### `eval_repop` ###
>
> -   This function just sends notifications and cleans up when we finish.
> -   Its name is not very appropriate for what it does.
> -   If we're already done, return.
>     *   This isn't bad, but it's specifically necessary because `eval_repop`
>         gets called from several places including the handlers for our
>         subservient OSDs completing an operation.
> -   If everyone's ack'd, fire off our ack handlers. If everyone's
>     completed, fire off our completion handlers.
> -   Notify anyone waiting for the version we've committed…
> -   And for those waiting on the one we've applied
> -   If we've done everything, update usage stats
>     *   Fire off `on_success` callbacks
>     *   Remove ourselves
>
> ## Flex Points ##
>
> ### PlacementGroup/FlexiblePlacement/OtherConsistencyStrategy ###
>
> -   Fast Dispatch currently shoves requests into a PG.
> -   `handle_op` calculates a pgid and actually gets the pointer to or
>     queues the session to wait on the associated PG
> -   If we implement `queue_op` in FlexiblePlacement we can do whatever we
>     want with it. We can ignore the WorkQueue.
> -   Much of the code in `ReplicatedPG` is useful even with other
>     semantic models than PG-ordered replication
> -   We might want to make `ReplicatedPG` a template and
>     supply the `PG` specific parts as a class instantiation. Then we
>     could create more classes for other partition/dispatch models.
> -   We will want a consitency/semantic variation orthogonal to the
>     partition/dispatch model.
>         * In this divide dividing objects into PGs where every all
>           operations are dispatched into the PG for whatever objet they
>       effect would be partition/gridding
>     * Whereas the total ordering on PG operations and constraints on
>           when a request blocks versus being served are the
>           consistency/semantics
>
> ### Allocation/Locking/Dispatch ###
>
> -   `OpRequest` (currently allocated in `do_op` and other structures
>     might be allocated at various points. IN our earlier prototype we
>     allocated OpRequest and another structure alongside the MOSDOp and
>     reused MOSDOps rather than deallocating them to cut down on
>     allocator use in the fast path.
>
>         That might fight with also promising designs using core-affine
>     memory management, unless we can determine core affinity quickly
>     before allocating the message. (Maybe peeking into the undecoded
>     bytes?)
> -   Lock freedom should be orthogonal to flexible placement. There may
>     be situations where we want lockful systems in flexible placement
>     (since flexible placement can have a variety of sync behaviors.)
>     and we know that Sam and others are interested in pursuing
>     lock-free designs in in PG-placement.
> -   In a lock-free design, if PGs are core-affine,
>     `enqueue_op` could just submit a message to a core without locking
>     or some of the thread/worker complexity.
> -   For Volumes, where the volume itself may be partitioned across cores
>     `enqueue_op` would have to look at the object name to find its target.
> -   Thus, we would want to pull that logic into a separate function
>     giving our dispatch target.
>
> ### Read-Write Symmetry ###
>
> -   Thankfully, `init_op_flags` is happy to set both read and write
> -   CLS in particular falls afoul of this. Futures might be the best
>     way to deal with it.
>
> ### Things we know we had to do anyway from previous work ###
>
> -   Use `std::map` less as a parameter/return type, same for std::set
> -   Objecter improvements
>     *   Less allocation, change data structures. A dual to some of the
>         work we want to do to make the EC interface less memory
>         intensive.
>     -   If we have zero copy there should be a way to materialize that
>         at the level of the client.
> -   See about bootstrapping client-side EC from EC overwrite
> -   Librados4 should be more like Objecter than it is like librados3
>
> ## Sam and `do_op` (♪ Doo-Wop? ♪) ##
>
> ### Discussion ###
>
> Notes taken during a BlueJeans call between Adam Emerson and Sam
> Just. (Sorry for any mistakes, recording a conversation while having
> it is tricky.)
>
> -   We should never have to block for I/O
> -   It's not `do_op` per se, though we are rewriting that to put it into a
>     continuation passing style with trampolines
> -   Various bits should be allowed to block, but whether they do or
>     don't should not effect the caller's code-flow.
> -   Once we've got to that point, everything after is easier
> -   We have to make sure we don't introduce so much overhead that it's measurable
> -   Eventually plans to go to a lock-free/sharded/partitioned style like Seastar
> -   We are not using Seastar's system because, when you fulfil a
>     promise you don't want to have the promise fulfilled in that
>     thread, it should be easy to fulfill it in a different thread.
> -   Also adapting an existing codebase to Seastar is much harder than
>     writing one from scratch to use it.
> -   It should also allow us to run all the OSDs in the same process
> -   We might want to have one messenger per logical OSD and have those share
>     threads (loses some efficiency gains but is backwards compatible.)
> -   These sorts of changes will also make EC overwrites much easier.
> -   Any refactors in the code should move us in this direction as a side effect
> -   The sooner the better, so if it does cause performance problems we
>     can find out soon
> -   Branch is wip-do-op in athanatos

The part is align with my struggling job.

>
> ### Brief Exploration of the code ###
>
> Adam Emerson looked briefly through the `wip-do-op` branch in
> `https://github.com/athanatos/ceph.git` to see what the general design
> looked like and how it matched up with our goals.
>
> -   Getting rid of the 'ondisk lock' looks good, someone good at
>     scheduling (Matt?) should review the queue. It should not use
>     `std::list`, though.
> -   The `do_replica_safe_reads` refactor isn't bad but doesn't seem to
>     have an immediate effect. Sam described it as providing safety
>     shunting things replicas could do into their own function, so
>     should make future development and refactor easier.
> -   It reinforces the idea that reads inhabit a separate magisterium
>     with its own law and dispensation from writes and is the oposite
>     direction from the read/write transactions we want. At least
>     potentially, we could use it as a fast/safe path and have it do a
>     more specialized transaction dispatch for reads, maybe.
> -   The `do_op`/`do_replica_op` split seems reasonable for the
>     replicated case, since in that one we want to transform the
>     transaction before sending it to the replicas. If we want to allow
>     CLS methods on EC pools (which we do, in principle) or mixed
>     read-write, then the distinction between primary and replica might
>     break down.
> -   Not sure if the error channel is better pe se, but since we
>     currently have a bunch of functions that return `int` to indicate
>     errors, it might be easier to integrate.
> -   C++ should have a `void` type a bit more like unit so you could
>     explicitly return `void()` from void functions. You'd think they
>     could put *that* in C++17 since their list of things to add to the
>     standard now consists entirely of "3 to the version number".
> -   The `future` implementation looks promising. I'll need to review
>     how it's put together in more detail later, how it's used is more
>     pertinent at the moment.
> -   Things make sense from a gradualist position. Given the desire for
>     a progression from from here to _A Really Fast OSD_ where we have
>     _A Working OSD_ at every point along the way, this approach makes
>     sense. Restructuring everything around a blocking-agnostic futures
>     design then opens the way to introducing asynchronous, lock-free code.
> -   This is also compatible with flexing, since we can have multiple
>     `LogicalOSD` implementations with different locking strategies or
>     core affinity.
> -   `aio_read` looks to be less aio than the name would suggest. This
>     isn't bad, it's reasonable to do a transform by having things call
>     blocking procedures in a way that will work if they become non-blocking.
> -   Reimplementing the blocking calls in terms of nonblocking calls is
>     smart.
> -   `OSDReactor` looks like it could be adapted, at least the public
>     interface, into LogicalOSD once we made it less PG specific.
> -   In principle it's a good idea. A LogicalOSD would have to be bound
>     closely to the DataSetInterface it worked with since they're two
>     halves of a queueing mechanism.
> -   The futures stuff definitely isn't naïve. We need to understand
>     the blockers and other details.  The idea of having a future yield
>     when it needs to wait for something is a good one.
> -   It uses `std::list` though.
>
> ## Why librados is not wonderful ##
>
> Not that we hate RADOS, we just like Objecter way more
> -   Does not support read and write in same op. Neither does RADOS, to
>     be fair, but we plan to fix that.
> -   Takes a giant lock with every operation. Yuck.
> -   Has its own 'callback' interface
> -   Its handing of asynchronous operations seems very heavyweight and
>     not natural.
> -   Hides the internal structure of RADOS operations
> -   Does not expose object locator in a useful way
> -   Does way too many allocations
> -   The dimensioning of the interface is weird, like binding the IoCtx
>     to a pool
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-04-18 17:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-15 21:05 Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Adam C. Emerson
2016-04-15 21:25 ` Milosz Tanski
2016-04-15 22:12   ` Adam C. Emerson
2016-04-18 17:19     ` Milosz Tanski
2016-04-15 22:09 ` Gregory Farnum
2016-04-15 22:29   ` Sessions and Persistence Adam C. Emerson
2016-04-15 22:38     ` Gregory Farnum
2016-04-15 22:44       ` Matt Benjamin
2016-04-16  3:03       ` Adam C. Emerson
2016-04-18  0:56     ` Sage Weil
2016-04-16  0:03   ` Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Mark Nelson
2016-04-16  0:07 ` Matt Benjamin
     [not found]   ` <CAFdRU72-CyuFRodb-HoNrBHWZRV7Xj4Ki-yHvxHPKAZeZ213Wg@mail.gmail.com>
2016-04-16  0:34     ` Shinobu Kinjo
2016-04-16  0:52       ` OSD set partitioning Adam C. Emerson
2016-04-18  4:00 ` Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Haomai Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.