netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
To: Tom Herbert <tom@herbertland.com>
Cc: davem@davemloft.net, netdev@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)
Date: Mon, 21 Sep 2015 08:24:22 -0400	[thread overview]
Message-ID: <20150921122422.GA31092@oracle.com> (raw)
In-Reply-To: <1442788161-2626305-1-git-send-email-tom@herbertland.com>

On (09/20/15 15:29), Tom Herbert wrote:
> 
> Kernel Connection Multiplexor (KCM) is a facility that provides a
> message based interface over TCP for generic application protocols.
> The motivation for this is based on the observation that although
> TCP is byte stream transport protocol with no concept of message
> boundaries, a common use case is to implement a framed application
> layer protocol running over TCP. To date, most TCP stacks offer
> byte stream API for applications, which places the burden of message
> delineation, message I/O operation atomicity, and load balancing
> in the application. With KCM an application can efficiently send
> and receive application protocol messages over TCP using a
> datagram interface.

A lot of this design is very similar to the PF_RDS/RDS-TCP
design. There too, we have a PF_RDS dgram socket (that already 
supports SEQPACKET semantics today) that can be tunneled over TCP.

The biggest design difference that I see in your proposal is 
that you are using BPF so presumably the demux has more flexibility
than RDS, which does the demux based on RDS port numbers?

Would it make sense to build your solution on top of RDS,
rather than re-invent solutions for many of the challenges
that one encounters when building a dgram-over-stream hybrid
socket (see "lessons learned" list below)?

Some things that were not clear to me from the patch-set:

The doc statses that we re-assemble packets the "stated length" -
but how will the receiver know the "stated length"? 
(fwiw, RDS figures that out from the header len in RDS,
and elsewhere I think you allude to some similar encaps
header - is that a correct understanding?)

not clear from the diagram: Is there one TCP socket per kcm-socket? 
what is the relation (one-one, many-one etc.)  between a kcm-socket and
a psock?  How does the ksock-psock-tcp-sock association get set up? 

the notes say one can "accept()" over a kcm socket- but "accept()"
is itself a connection-oriented concept- one does not accept() on
a dgram socket. So what exactly does this mean, and why not just
use the well-defined TCP socket semantics at that point (with something
like XDR for message boundary marking)?

In the "fwiw" bucket of lessons learned from RDS..  please ignore if
you were already aware of these- 

In the case of RDS, since multiple rds/dgram sockets share a single TCP
socket, some issues that have to be dealt with are

- congestion/starvation: we dont want tcp to start advertising
  zero-window because one dgram socket pair has flooded the pipe
  and the peer is not reading. So the RDS protocol has port-congestion
  RDS control plane messages that track congestion at the RDS port.

- imposes some constraints on the TCP send side- if sock1 and sock2
  are sharing a tcp socket, and both are sending dgrams over the 
  stream, dgrams from sock1 may get interleaved  (see comments above
  rds_send_xmit() for a note on how rds deals witt this). There are ways
  to fan this out over multiple tcp sockets (and I'm working on those,
  to improve the scaling), but just a note that there is some complexity
  to be dealt with here. Not sure if this was considered in the "KCM
  sockets" section in patch2..

- in general the "dgram-over-stream" hybrid has some peculiar issues. E.g.,
  dgram APIs like BINDTODEVICE and IP_PKTINFO cannot be applied
  to the underlying stream. In the typical use case for RDS (database 
  clusters) there's a reasonable workaround for this using network
  namespaces to define bundles of outgoing interfaces, but that solution
  may not always be workable for other use-cases. Thus it might actually
  be more obvious to simply use tcp sockets (and use something like XDR
  for message boundary markers on the stream).

--Sowmini
 

  parent reply	other threads:[~2015-09-21 12:24 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-20 22:29 [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
2015-09-20 22:29 ` [PATCH RFC 1/3] rcu: Add list_next_or_null_rcu Tom Herbert
2015-09-20 22:29 ` [PATCH RFC 2/3] kcm: Kernel Connection Multiplexor module Tom Herbert
2015-09-22 16:26   ` Alexei Starovoitov
2015-09-22 17:26     ` Tom Herbert
2015-09-22 18:41       ` Alexei Starovoitov
2015-09-23  9:36   ` Thomas Graf
2015-09-20 22:29 ` [PATCH RFC 3/3] kcm: Add statistics and proc interfaces Tom Herbert
2015-09-21 12:24 ` Sowmini Varadhan [this message]
2015-09-21 17:33   ` [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
2015-09-21 21:26     ` Sowmini Varadhan
2015-09-21 22:36       ` Tom Herbert
2015-09-21 22:53         ` Sowmini Varadhan
2015-09-22  9:14 ` Thomas Martitz
2015-09-22 16:46   ` Tom Herbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150921122422.GA31092@oracle.com \
    --to=sowmini.varadhan@oracle.com \
    --cc=davem@davemloft.net \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).