From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Layton Subject: Re: msgr2 protocol Date: Tue, 13 Sep 2016 10:48:54 -0400 Message-ID: <1473778134.4740.44.camel@redhat.com> References: <20160610190510.GA18999@degu.eng.arb.redhat.com> <20160611230503.GA18268@degu.eng.arb.redhat.com> <1473765504.4740.14.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Return-path: Received: from mail-qk0-f180.google.com ([209.85.220.180]:32924 "EHLO mail-qk0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752710AbcIMOs7 (ORCPT ); Tue, 13 Sep 2016 10:48:59 -0400 Received: by mail-qk0-f180.google.com with SMTP id w204so177367321qka.0 for ; Tue, 13 Sep 2016 07:48:59 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: Haomai Wang , Marcus Watts , Gregory Farnum , ceph-devel On Tue, 2016-09-13 at 13:31 +0000, Sage Weil wrote: > On Tue, 13 Sep 2016, Jeff Layton wrote: > > On Sun, 2016-09-11 at 17:05 +0000, Sage Weil wrote: > > > On Sat, 10 Sep 2016, Haomai Wang wrote: > > > > About thing is v1/v2 compatible. I rethink the details: > > > >  > > > > 0. we need to define the new banner which must longer than before("ceph v027") > > > > 1. assume msgr v2 banner is "ceph v2 %64llx %64llx\n" > > > > 2. both in simle/async codes, server side must issue banner firstly > > > > 3. if server side supports v2 and client only supports v1, client will > > > > receive 9 bytes and do memcmp, then reject this connection via closing > > > > socket. So server side could retry the older version > > > > 4. if server side only supports v1 and client supports v2, client > > > > according banner to reply corresponding banner > > > >  > > > > This tricky design is based on the implementation fact "accept side > > > > issue the banner firstly" and "new banner is longer than old banner", > > > > and this way doesn't need to involve other dependences like mon port > > > > changes. > > > >  > > > > Does this way has problem? > > >  > > > I was thinking we avoid this problem and any hacky initial handshakes by  > > > speaking v2 on the new port and v1 on the old port.  Then the monmap has  > > > an entity_addrvec_t with both a v1 and v2 address (encoding with just the  > > > v1 address for old clients). Same for the OSDs. > > >  > > > The v1 handshake just isn't extensible (how do you tell a v2 client  > > > connecting that you speak both v1 and v2?). > > >  > >  > > Depending on port assignments for the protocol is pretty icky though. > > There may be valid reasons to use different ports in some environments > > and then that heuristic goes right out the window. > >  > > One thing that is really strange about both the old and new protocols > > is that they have the client and server sending the initial exchange > > concurrently, or have the server send it first.  While it may speed up > > the initial negotiation slightly, it makes it really hard to handle > > fallback to earlier protocol versions (as Haomai pointed out), as the > > client is responsible for handing reconnects. > >  > > Consider the case where we have a client that supports only v1 but a > > server that supports v1 and v2. Client connects and then server sends a > > v2 message. Client doesn't understand it and closes the connection and > > reconnects, only to end up in the same situation on the second attempt. > >  > > There's no way for the server to preserve the state from the initial > > connection attempt and handle the new connection with v1. Would it not > > make more sense to have the client connect and send its initial banner, > > and then let the server decide what sort of banner to send based on > > what the client sent? > > This is why the v2 banner has the features values (%lx with supported and  > required bits).  Clients and servers (connecter and accepters, really,  > since servers talk to each other too) can concurrently announce what they  > support and require and then go from there.  It doesn't help with the v1  > transition, but the addrvec changes (entity_addr_t now has a type  > indicating which protocol is spoken, and multiple addrs can be listed for  > any server) along with a mon port change (which we have to do anyway to  > switch to our IANA assigned port) handle the v1 transition. > Ahh ok, I didn't realize ceph was squatting on a port! Ok, then if you're planning to switch to a new well-known port anyway, then a clean break like this makes more sense. I'll confess though that I don't quite understand the whole point of the entity_addr_t's. What purpose does it serve to exchange network addresses here? Is it simply to inform the peer of other ways that it can be reached? What happens if I pick up my laptop that's acting as a ceph client and wander onto a new network. Does anything break? > Are there other reasons to do the client banner first?   > > I think that's the main one. The only other reason I can think of might be to guard against information disclosure to port scanners. If you require the client to send a banner first, then the server could drop the connection if it doesn't look right without ever sending anything. That said, given that we're going to be using a IANA designated port in most cases, that's not going to be terribly useful. The port scanners would just send a bogus but legit-looking banner to that port. > >  > > > >  > > > >  > > > > > On Sat, Sep 10, 2016 at 11:37 AM, Haomai Wang wrote: > > > > > > > > > > > > > > > > On Sat, Sep 10, 2016 at 5:14 AM, Sage Weil wrote: > > > > >> > > > > >> On Sat, 10 Sep 2016, Haomai Wang wrote: > > > > >> > @sage in current impl, when logic fault like state mismatch, data format > > > > >> > mismatch or anything else, connection will abort session via closing > > > > >> > socket. > > > > >> > And the peer side would do something according to policy too. > > > > >> > In msgr v2 when introducing multi streams in the same connection, we > > > > >> > can't > > > > >> > simply abort socket to indicate something wrong now. I think we need to > > > > >> > introduce TAG_ABORT with error message. > > > > >> > > > > > >> > But the peer side may stuck into a state like reading enough data as > > > > >> > "length" indicate. It may miss the TAG_ABORT notify or other reconnect > > > > >> > tag. > > > > >> > A tricky thing is we use tcp OOB bit to send which exactly trigger > > > > >> > "urgent" > > > > >> > signal when receiving, but it only occur 1 byte in tcp proto which can > > > > >> > be > > > > >> > used here to indicate the stream id(32bit designed now). > > > > >> > > > > > >> > What's more, multi stream mixed within one socket may make trouble to > > > > >> > message receiving when potential tcp packet silent error. So it looks we > > > > >> > can't use the same socket the multi stream to meet our demands. > > > > >> > > > > > >> > Any idea? > > > > >> > > > > >> I think we intorduce a TAG_ABORT to interrupt the stream.  And then we > > > > >> have to assume that the low-level msgr2 implementation that reads and > > > > >> writes frames (which have their own frame_len) is not buggy.  In practice, > > > > >> the aborts tend to happen because we get a message we don't understand > > > > >> (version mismatch, encoding compatibility bug, etc.), and that'll happen > > > > >> at a higher level after frames have been read... so a TAG_ABORT will be > > > > >> sufficient. > > > > > > > > > > > > > > > yes, if frame is ok. It should be ok.... Let's go through this firstly... > > > > > The worse case is the frame length is not expected as data transferred. > > > > > > > > > >> > > > > >> > > > > >> Also, we can have an option to make aborts close the socket.  That'll be > > > > >> fine for now anyway, although later it's probably to disruptive when > > > > >> multiple streams are sharing a socket... > > > > >> > > > > >> sage > > > > >> > > > > >> > > > > >>  > > > > > >> > > > > > >> > > > > > > >> > On Mon, Jun 13, 2016 at 7:59 AM, Sage Weil wrote: > > > > >> >       On Sat, 11 Jun 2016, Marcus Watts wrote: > > > > >> >       > If the client doesn't look at "features" before it sends > > > > >> >       stuff, it > > > > >> >       > will not be able to be very smart about taking advantage of > > > > >> >       some > > > > >> >       > future better method.  In fact, there isn't much advantage > > > > >> >       > to the server sending anything early - it could just as easily > > > > >> >       > wait until after it's seen the clients request. > > > > >> >       > > > > > >> >       > Failing hard & retrying on a failed reconnect is going to be > > > > >> >       slower. > > > > >> >       > On the bright side, at least it shouldn't happen often. > > > > >> > > > > > >> >       Yep.  Well, I think it is the client's (limited choice).  If it > > > > >> >       needs to > > > > >> >       know the server features, it needs to either wait for them, or > > > > >> >       make some > > > > >> >       optimistic choice and be prepared to pay the cost of a mistake. > > > > >> >       We should > > > > >> >       give the client choice, though, if we can. > > > > >> > > > > > >> >       > If you're sending encryption (w/ different auth or keys) from > > > > >> >       several > > > > >> >       > different streams, how are you planning to indicate which bits > > > > >> >       > go with which scheme?, and which bits are you planning to > > > > >> >       encrypt > > > > >> >       > and which not? > > > > >> > > > > > >> >       This is what he stream ids are for, and why the outer portion of > > > > >> >       the frame > > > > >> >       is unencrypted.  See > > > > >> > > > > > >> > > > > > >> > https://github.com/ceph/ceph/pull/9461/files#diff-83789b4be697d82eedbcbe330 > > > > >> >       c44b436R68 > > > > >> > > > > > >> >        +  stream_id (le32) > > > > >> >        +  frame_len (le32) > > > > >> >        +  tag (TAG_* byte) > > > > >> >        +  payload > > > > >> >        +  [payload padding -- only present after stream auth phase] > > > > >> >        +  [signature -- only present after stream auth phase] > > > > >> > > > > > >> >       The tag and payload (and padding) would be encrypted or signed, > > > > >> >       but not > > > > >> >       the stream id and frame_len. > > > > >> > > > > > >> >       > Byte count limits.  Basically, you don't want collisions > > > > >> >       because > > > > >> >       > of duplicated keys or data.  This depends on your crypto > > > > >> >       system, > > > > >> >       > so, for instance, you should not encrypt with one key more > > > > >> >       than > > > > >> >       >       aes, cbc        about 2^68 bytes > > > > >> >       >       aes, ctr        exactly 2^128 bytes > > > > >> >       > more generally, this depends on mode, blocksize, ... > > > > >> >       > This applies across *all* uses of the key - and so you would > > > > >> >       > generally want to use the session key directly as little as > > > > >> >       possible. > > > > >> >       > (in particular, using the session key for ctr directly would > > > > >> >       be very very bad.) > > > > >> >       > > > > > >> >       > If you've got multiple streams going already, you should be > > > > >> >       able > > > > >> >       > to include a fairly simple rekey method with little effort. > > > > >> >       > For instance, as part of the method, you could, > > > > >> >       >       up front as part of the method > > > > >> >       >               send a per-stream key encrypted under the shared > > > > >> >       secret. > > > > >> >       >       prepend to the first data sent in a payload > > > > >> >       >               byte limit, stream key #0 (encrypted under the > > > > >> >       per-stream key) > > > > >> >       >               then encrypt the next N bytes with stream key #0 > > > > >> >       >       when the byte limit is reached, prepend to the > > > > >> >       >               next data sent in a payload > > > > >> >       >               byte limit, stream key #1 (encrypted under the > > > > >> >       per-stream key) > > > > >> >       >               then encrypt the next N bytes with stream key #1 > > > > >> >       >       &etc. > > > > >> > > > > > >> >       Good idea.  If I understand correctly, it means that the > > > > >> >       session_key is > > > > >> >       only used to send the new/next random encryption key, and if we > > > > >> >       make the > > > > >> >       byte limit part of the initial protocol we get the rotation we > > > > >> >       need.  It > > > > >> >       might be simpler to do it as a frame limit instead of byte > > > > >> >       limit, and > > > > >> >       assume max-length frames (2^32 bytes).  We could still be super > > > > >> >       conservative and rotate the encryption key every 2^16 messages > > > > >> >       or > > > > >> >       something...?  And rotating the key on frame boundaries should > > > > >> >       be much > > > > >> >       simpler to implement. > > > > >> > > > > > >> >       Anyway, that part can be defined a bit later, I think. > > > > >> > > > > > >> >       Thanks! > > > > >> >       sage > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > >  > > > >  > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html > > --  > > > Jeff Layton > > --  > > > Jeff Layton > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at  http://vger.kernel.org/majordomo-info.html > >  > >  -- Jeff Layton