From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next) Date: Fri, 15 Apr 2016 19:03:05 -0500 Message-ID: <57118139.1030004@redhat.com> References: <20160415210536.GA16458@ultraspiritum.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mx1.redhat.com ([209.132.183.28]:36557 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751968AbcDPADI (ORCPT ); Fri, 15 Apr 2016 20:03:08 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id A65D93B70E for ; Sat, 16 Apr 2016 00:03:07 +0000 (UTC) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum , The Sacred Order of the Squid Cybernetic On 04/15/2016 05:09 PM, Gregory Farnum wrote: > On Fri, Apr 15, 2016 at 2:05 PM, Adam C. Emerson wrote: >> Ceph Developers, >> >> We've put together a few of the main ideas from our previous work in= a >> brief form that we hope people will be able to digest, consider, and >> debate. We'd also like to discuss them with you at Ceph Next this >> Tuesday. >> >> Thank you. >> >> >> ---8<--- >> >> >> We have been looking at improvements to Ceph, particularly RADOS, >> while focusing on flexibility (allowing users to do more things) >> and performance. We have come up with a few proposals with these two >> things in mind. Sessions and read-write transactions aim to allow >> clients to batch up multiple operations in a way that is safe and >> correct, while allowing clients to gain the advantages of atomic >> read-write operations without having to lock. Sessions also provide >> a foundation for flow-control which ultimately improves performance >> by preventing an OSD from being ground into uselessness under a >> storm of impossible requests. The CLS proposal is a logical follow-o= n >> from the read-write proposal, as we attempt to address some problems >> of correctness that exist now and consider how to integrate the >> facility into an asynchronous world. >> >> Flexible Placement, as you would expect from the name, is about >> allowing users more control, as are Flexible Semantics. They both >> have profound performance implications, as tuning placement to bette= r >> match a workload can increase throughput, and relaxed consistency ca= n >> decrease latency. The proposed Interfaces are meant to support both = as >> well as work currently being done to allow an asynchronous OSD and t= o >> hide details like locking and thread pools so that backends can be >> written with different forms of concurrency and load-balancing >> across processors. >> >> Finally, Map Partitioning is not directly related to code paths with= in >> the OSD itself, but does affect everything that can be done with Cep= h. >> People are beginning to run into limits on how large a Ceph cluster = can >> grow and how many ways they can be partitioned, and both these probl= ems >> fundamentally derive from the way the OSD map is handled by the moni= tors. >> >> There are also some notes at the end. They are not critical, but if = you >> find yourself asking "What were they thinking?" the notes might help= =2E >> >> # Sessions and Read-Write # >> >> From `ReplicatedPG.cc`. >> >> ```c++ >> // Write operations aren't allowed to return a data payload because >> // we can't do so reliably. If the client has to resend the request >> // and it has already been applied, we will return 0 with no >> // payload. Non-deterministic behavior is no good. However, it is >> // possible to construct an operation that does a read, does a guard >> // check (e.g., CMPXATTR), and then a write. Then we either succeed >> // with the write, or return a CMPXATTR and the read value. >> =E2=80=A6 >> if (ctx->op_t->empty() || result < 0) { >> =E2=80=A6 >> if (ctx->pending_async_reads.empty()) { >> complete_read_ctx(result, ctx); >> } else { >> in_progress_async_reads.push_back(make_pair(op, ctx)); >> ctx->start_async_reads(this); >> } >> return; >> } >> =E2=80=A6 >> // issue replica writes >> ceph_tid_t rep_tid =3D osd->get_tid(); >> >> RepGather *repop =3D new_repop(ctx, obc, rep_tid); >> >> issue_repop(repop, ctx); >> eval_repop(repop); >> ``` >> >> As you can see, if we have any writes (all mutations end up in the >> `op_t` transaction), we just flat out don't do the requested read >> operations. If we don't have any writes, we perform the read >> operations and return. This is justified in the comment above becau= se >> of the non-deterministic behavior of resent read-write operations. >> >> This is not an unsolved problem and we can bootstrap a solution on o= ur >> existing `Session` infrastructure. >> >> ## An upgraded session ## >> >> Behold, `OSDSession`: >> ```c++ >> struct Session : public RefCountedObject { >> EntityName entity_name; >> OSDCap caps; >> int64_t auid; >> ConnectionRef con; >> WatchConState wstate; >> =E2=80=A6 >> }; >> ``` >> >> This structure exists once for every connection to the OSD. Where th= ey >> are created depends on who is doing the creation. In the case of >> clients (what we're interested in) it occurs in `ms_handle_authorize= ri` >> ```c++ >> =E2=80=A6 >> isvalid =3D authorize_handler->verify_authorizer(cct, monc->rotating= _secrets, >> authorizer_data, aut= horizer_reply, name, global_id, caps_info, session_key, &auid); >> >> if (isvalid) { >> Session *s =3D static_cast(con->get_priv()); >> if (!s) { >> s =3D new Session(cct); >> con->set_priv(s->get()); >> s->con =3D con; >> dout(10) << " new session " << s << " con=3D" << s->con << " ad= dr=3D" << s->con->get_peer_addr() << dendl; >> } >> >> s->entity_name =3D name; >> if (caps_info.allow_all) >> s->caps.set_allow_all(); >> s->auid =3D auid; >> =E2=80=A6 >> } >> ``` >> >> In order to solve this problem, we propose a new data structure, >> modelled on NFSv4.1 >> ```c++ >> struct OpSlot { >> uint64_t seq; >> int r; >> MOSDOpReplyRef cached; // Nullable >> bool completed; >> }; >> ``` >> >> We do not want to give the OSD an unbounded obligation to hang on to >> old message replies: that way lies madness. So, the additions to >> `Session` we might make are: >> >> ```c++ >> struct Session : public RefCountedObject { >> =E2=80=A6 >> uint32_t maxslots; // The maximum number of operations this clien= t >> // may have in flight at once; >> std::vector slots // The vector of in-progress operations >> ceph::timespan slots_expire; // How long we wait to hear from a >> // client before the OSD is free to >> // drop session resources >> cepu::coarse_mono_time last_contact; // When (by our measure) we >> // last received an operatio= n >> // from the client. >> }; >> ``` >> >> ## Message Additions ## >> >> The OSD needs to communicate this information to the client. The mos= t >> useful way to do this is with an addition to `MOSDOpReply`. >> >> ```c++ >> class MOSDOpReply : public Message { >> =E2=80=A6 >> uint32_t this_slot; >> uint64_t this_seq; >> uint32_t max_slot; >> ceph::timespan timeout; >> =E2=80=A6 >> }; >> ``` >> >> This overlaps with the function of the transaction ID, since the >> slot/sequence/OSD triple uniquely identifies an operation. Unlike th= e >> transaction ID, this provides consistent semantics and a measure of >> flow control. >> >> To match our reply, the `MOSDOp` would need to be amended. >> ```c++ >> class MOSDOp : public Message { >> =E2=80=A6 >> uint32_t this_slot; >> uint64_t this_seq; >> bool please_cache; >> =E2=80=A6 >> }; >> ``` >> >> ## Operations ## >> >> ### Connecting ### >> >> A client, upon connecting to an OSD for the first time should send a >> `this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD i= t >> should use the `this_slot` and `this_seq` values from before it lost >> its connection. If an OSD has state for a client and receives a >> `(slot,seq) =3D (0,0)` then it should feel free to free any saved st= ate >> and start anew. >> >> ### OSD Feedback ### >> >> In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq= ` to >> the value from the `MOSDOp` to which we're replying. >> >> More usefully, the OSD can inform the client how many operations it = is >> allowed to send concurrently with `max_slot`. The client must **not*= * >> send a slot value higher than `max_slot`. (The OSD should error if i= t >> does.) >> >> The OSD may increase the number of operations allowed in-flight >> if it has capacity by increasing `max_slot`. If it finds itself >> lacking capacity, it may decrease `max_slot`. If it does, the client >> should respect the new bound. (The OSD should feel free to free the >> rescinded slots as soon as the client sends another `MOSDOp` with a >> slot value equal to one on which the new `max_slot` has been sent.) >> >> If the client sends a `this_seq` lower than the one held for a slot = by >> the OSD, the OSD should error. If it is more than one greater than t= he >> current `this_seq`, the OSD should error. >> >> ### Caching ### >> >> The client is in an excellent position to know whether it **requires= ** >> the output of a previous operation of mixed reads and writes on >> resend, or whether it merely needs the status on resend. Thus, we le= t >> the client set `please_cache` to request that the OSD store a >> reference to the sent message in the appropriate `OpSlot`. >> >> The OSD is in an excellent position to know how loaded it is. It can >> calculate a bound on how large a given reply will be before executin= g >> it. Thus, the OSD can send an error if the client has requested it >> cache something larger than it feels comfortable caching. >> >> Assuming no errors, the behavior, for any slot, is this: If the clie= nt >> sends an `MOSDOp` with a `this_seq` one greater than the current val= ue >> of `OpSlot::seq`, that represents a new operation. Increment >> `OpSlot::seq`, clear `OpSlot::completed` and begin the operation. Wh= en >> the operation finishes, set `OpSlot::completed`. If `please_cache` h= as been >> set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply s= tore the >> result code in `OpSlot::r`. >> >> If the client sends an `MOSDOp` with a `this_seq` equal to >> `OpSlot::seq` and `OpSlot::completed` is false, drop the request. (W= e >> will reply when it completes.) If it has completed, send the stored >> `OpSlot::MOSDOpReply` if there is one, otherwise send just a replay >> with just `OpSlot::r`. >> >> ### Reconnection ### >> >> Currently the `Session` is destroyed on reset and a new one is creat= ed >> on authorization. In our proposed system the `Session` will not be >> destroyed on reset, it will be moved to a structure where it can be >> looked up and destroyed after `timeout` since the last message >> received. >> >> On connection, the OSD should first look up a `Session` keyed >> on the entity name and create one if that fails. > > So the most common time we really get replay operations is when one o= f > the OSDs crash or a PG's acting set changes for some other reason. > Which means these "cached" operation results need to be persisted to > disk and then cleaned up, a la the pglog. > I don't see anything in these data structures that explains how we do > that efficiently, which is the biggest problem and the reason we don'= t > already do reply caching. Am I missing something? > > And do you think maybe you could split this up into a thread for each > topic? I'm having trouble digesting it as such a wall of text. :) Seconded! :D > -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html