All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph Messaging on Accelio (libxio) RDMA
@ 2013-12-11 22:32 Matt W. Benjamin
  2013-12-11 22:58 ` Gregory Farnum
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Matt W. Benjamin @ 2013-12-11 22:32 UTC (permalink / raw)
  To: ceph-devel; +Cc: Sage Weil, Yaron Haviv, Eyal Salomon

Hi Ceph devs,

For the last several weeks, we've been working with engineers at
Mellanox on a prototype Ceph messaging implementation that runs on
the Accelio RDMA messaging service (libxio).

Accelio is a rather new effort to build a high-performance, high-throughput
message passing framework atop openfabrics ibverbs and rdmacm primitives.

It's early days, but the implementation has started to take shape, and
gives a feel for what the Accelio architecture looks like when using the
request-response model, as well as for our prototype mapping of the
xio framework concepts to the Ceph ones.

The current classes and responsibility breakdown somewhat as follows.
The key classes in the TCP messaging implementation are:

Messenger (abstract, represents a set of bidirectional communication endpoints)
SimpleMessenger (concrete TCP messenger)

Message (abstract, models a message between endpoints, all Ceph protocol messages
derive from Message, obviously)

Connection (concrete, though it -feels- abstract;  Connection models a communication
endpoint identifiable by address, but has -some- coupling with the internals of
SimpleMessenger, in particular, with its Pipe, below).

Pipe (concrete, an active (threaded) object that encapsulates various operations on
one side (send or recv) of a TCP connection.  The Pipe is really where a -lot- of
the heavy lifting of SimpleMessenger is localized, and not just in the obvious
ways--eg, Pipe drives the dispatch queue in SimpleMessenger, so a lot of it's
visible semantics are built in cooperation with Pipe).

Dispatcher (abstract, models the application processing messages and sending replies--ie, the upper edge of Messenger).

The approach I took in incorporating Accelio was to build on the key abstractions
of Messenger, Connection, and Dispatcher, and Message, and build a corresponding
family of concrete classes:

XioMessenger (concrete, implements Messenger, encapsulates xio endpoints, aggregates
dispatchers as normal).

XioConnection (concrete, implements Connection)

XioPortal (concrete, a new class that represents worker thread contexts for all XioConnections in a given XioMessenger)

XioMsg (concrete, a "transfer" class linking a sequence of low-level Accelio datagrams with a Message being sent)

XioReplyHook (concrete, derived from Ceph::Context [indirectly via Message::ReplyHook], links a sequence of low-level Accelio datagrams for a Message that has been received-- that is, part of a new "reply" abstraction exposed to Message and Messenger).

As noted above, there is some leakage of SimpleMessenger primitives into classes that are intended to be abstract, and some refactoring was needed to fit XioMessenger into the framework.  The main changes I prototyped are as follows:

All traces of Pipe are removed from Connection, which is made abstract.  A new
PipeConnection is introduced, that knows about Pipes.  SimpleMessenger now uses
instances of PipeConnection as its concrete connection type.

The most interesting changes I introduced are driven by the need to support
Accelio's request/response model, which exists mainly to support RDMA memory
registration primitives, and needs a concrete realization in the Messenger
framework.

To accomodate it, I've introduced two concepts.  First, callers replying to a Message use a new Messenger::send_reply(Message *msg, Message *reply) method.  In SimpleMessenger, this just maps to a call to send_message(Message *, Connection*), but in XioMessenger, the reply is delivered through a new Message::reply_hook completion functor that XioConnection sets when a message is being dispatched.  This is a general mechanism, new Messenger implementations can derive from Message::ReplyHook to define their own reply behavior, as needed.

A lot of low level details of the mapping from Message to Accelio messaging are
currently in flux, but the basic idea is to re-use the current encode/decode primitives as far as possible, while eliding the acks, sequence # and tids, and timestamp behaviors of Pipe, or rather, replacing them with mappings to Accelio primitives.  I have some wrapper classes that help with this.  For the moment, the existing Ceph message headers and footers are still there, but are now encoded/decoded, rather than hand-marshalled.  This means that checksumming 
is probably mostly intact.  Message signatures are not implemented.

What works.  The current prototype isn't integrated with the main server daemons
(e.g., OSD) but experimental work on that is in progress.  I've created a pair of
simple standalone client/server applications simple_server/simple_client and
a matching xio_server/xio_client, that provide a minimal message dispatch loop with
a new SimpleDispatcher class and some other helpers, as a way to work with both
messengers side-by-side.  These are currently very primitive, but will probably
do more things soon.  The current prototype sends messages over Accelio, but has some issue
with replies, that should be fixed shortly.  It leaks lots of memory, etc.

We've pushed a work-in-progress branch "xio-messenger" to our external github
repository, for community review.  Find it here:

https://github.com/linuxbox2/linuxbox-ceph

Thanks!

Matt

-- 
Matt Benjamin
CohortFS, LLC.
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-11 22:32 Ceph Messaging on Accelio (libxio) RDMA Matt W. Benjamin
@ 2013-12-11 22:58 ` Gregory Farnum
  2013-12-12  1:13   ` Matt W. Benjamin
  2013-12-12  0:59 ` Sage Weil
  2013-12-12 10:19 ` Kasper Dieter
  2 siblings, 1 reply; 17+ messages in thread
From: Gregory Farnum @ 2013-12-11 22:58 UTC (permalink / raw)
  To: Matt W. Benjamin; +Cc: ceph-devel, Sage Weil, Yaron Haviv, Eyal Salomon

On Wed, Dec 11, 2013 at 2:32 PM, Matt W. Benjamin <matt@cohortfs.com> wrote:
> Hi Ceph devs,
>
> For the last several weeks, we've been working with engineers at
> Mellanox on a prototype Ceph messaging implementation that runs on
> the Accelio RDMA messaging service (libxio).

Very cool! An RDMA Messenger has been a cool-sounding project for
which we haven't been able to get time for several years; I'm glad
somebody is getting the chance to explore it seriously.

> Accelio is a rather new effort to build a high-performance, high-throughput
> message passing framework atop openfabrics ibverbs and rdmacm primitives.
>
> It's early days, but the implementation has started to take shape, and
> gives a feel for what the Accelio architecture looks like when using the
> request-response model, as well as for our prototype mapping of the
> xio framework concepts to the Ceph ones.
>
> The current classes and responsibility breakdown somewhat as follows.
> The key classes in the TCP messaging implementation are:
>
> Messenger (abstract, represents a set of bidirectional communication endpoints)
> SimpleMessenger (concrete TCP messenger)
>
> Message (abstract, models a message between endpoints, all Ceph protocol messages
> derive from Message, obviously)
>
> Connection (concrete, though it -feels- abstract;  Connection models a communication
> endpoint identifiable by address, but has -some- coupling with the internals of
> SimpleMessenger, in particular, with its Pipe, below).
>
> Pipe (concrete, an active (threaded) object that encapsulates various operations on
> one side (send or recv) of a TCP connection.  The Pipe is really where a -lot- of
> the heavy lifting of SimpleMessenger is localized, and not just in the obvious
> ways--eg, Pipe drives the dispatch queue in SimpleMessenger, so a lot of it's
> visible semantics are built in cooperation with Pipe).
>
> Dispatcher (abstract, models the application processing messages and sending replies--ie, the upper edge of Messenger).

Good summary. You've left me feeling a little embarrassed about the
Connection class with that description. ;)

> The approach I took in incorporating Accelio was to build on the key abstractions
> of Messenger, Connection, and Dispatcher, and Message, and build a corresponding
> family of concrete classes:
>
> XioMessenger (concrete, implements Messenger, encapsulates xio endpoints, aggregates
> dispatchers as normal).
>
> XioConnection (concrete, implements Connection)
>
> XioPortal (concrete, a new class that represents worker thread contexts for all XioConnections in a given XioMessenger)
>
> XioMsg (concrete, a "transfer" class linking a sequence of low-level Accelio datagrams with a Message being sent)
>
> XioReplyHook (concrete, derived from Ceph::Context [indirectly via Message::ReplyHook], links a sequence of low-level Accelio datagrams for a Message that has been received-- that is, part of a new "reply" abstraction exposed to Message and Messenger).
>
> As noted above, there is some leakage of SimpleMessenger primitives into classes that are intended to be abstract, and some refactoring was needed to fit XioMessenger into the framework.  The main changes I prototyped are as follows:
>
> All traces of Pipe are removed from Connection, which is made abstract.  A new
> PipeConnection is introduced, that knows about Pipes.  SimpleMessenger now uses
> instances of PipeConnection as its concrete connection type.

This all makes sense.

> The most interesting changes I introduced are driven by the need to support
> Accelio's request/response model, which exists mainly to support RDMA memory
> registration primitives, and needs a concrete realization in the Messenger
> framework.
>
> To accomodate it, I've introduced two concepts.  First, callers replying to a Message use a new Messenger::send_reply(Message *msg, Message *reply) method.  In SimpleMessenger, this just maps to a call to send_message(Message *, Connection*), but in XioMessenger, the reply is delivered through a new Message::reply_hook completion functor that XioConnection sets when a message is being dispatched.  This is a general mechanism, new Messenger implementations can derive from Message::ReplyHook to define their own reply behavior, as needed.

Can you talk more about the request/response model in the
communication layer and why you're explicitly specifying what messages
are replies to others? I'm not sure what makes that useful, or how a
model where it is deals with stuff like
1) the two "ack/commit" responses to write requests, or
2) some of the requests in which there is not an explicit response
message (especially OSD->monitor stuff like failure reports), or
3) where a request does not get a direct response message, but
triggers a special indirect response of some kind (like the monitor
not acking a change request explicitly, but making sure to send new
maps to the person who requested the map change).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-11 22:32 Ceph Messaging on Accelio (libxio) RDMA Matt W. Benjamin
  2013-12-11 22:58 ` Gregory Farnum
@ 2013-12-12  0:59 ` Sage Weil
  2013-12-12  1:33   ` Matt W. Benjamin
  2013-12-12  2:14   ` Mark Nelson
  2013-12-12 10:19 ` Kasper Dieter
  2 siblings, 2 replies; 17+ messages in thread
From: Sage Weil @ 2013-12-12  0:59 UTC (permalink / raw)
  To: Matt W. Benjamin; +Cc: ceph-devel, Yaron Haviv, Eyal Salomon

Hi Matt,

Thanks for posting this!  Some comments and questions below.

On Wed, 11 Dec 2013, Matt W. Benjamin wrote:
> Hi Ceph devs,
> 
> For the last several weeks, we've been working with engineers at
> Mellanox on a prototype Ceph messaging implementation that runs on
> the Accelio RDMA messaging service (libxio).
> 
> Accelio is a rather new effort to build a high-performance, high-throughput
> message passing framework atop openfabrics ibverbs and rdmacm primitives.

I was originally thinking that xio was going to be more mellanox-specific, 
but it looks like it runs over multiple transports (even tcp!).  (I'm sure 
I've been told this before but it apparently didn't sink in.)  Is there 
also a mellanox-specific backend (that is not ibverbs) that takes any 
special advantage of mellanox hw capabilities?

Similarly, are there other projects or vendors that are looking at xio at 
this point?  I've seen similar attempts to create this sort of library 
(CCI comes to mind: https://github.com/CCI/cci).  Have these previous 
attempts influenced the design of xio at all?

> It's early days, but the implementation has started to take shape, and
> gives a feel for what the Accelio architecture looks like when using the
> request-response model, as well as for our prototype mapping of the
> xio framework concepts to the Ceph ones.
> 
> The current classes and responsibility breakdown somewhat as follows.
> The key classes in the TCP messaging implementation are:
> 
> Messenger (abstract, represents a set of bidirectional communication endpoints)
> SimpleMessenger (concrete TCP messenger)
> 
> Message (abstract, models a message between endpoints, all Ceph protocol messages
> derive from Message, obviously)
> 
> Connection (concrete, though it -feels- abstract;  Connection models a communication
> endpoint identifiable by address, but has -some- coupling with the internals of
> SimpleMessenger, in particular, with its Pipe, below).
> 
> Pipe (concrete, an active (threaded) object that encapsulates various operations on
> one side (send or recv) of a TCP connection.  The Pipe is really where a -lot- of
> the heavy lifting of SimpleMessenger is localized, and not just in the obvious
> ways--eg, Pipe drives the dispatch queue in SimpleMessenger, so a lot of it's
> visible semantics are built in cooperation with Pipe).
> 
> Dispatcher (abstract, models the application processing messages and sending replies--ie, the upper edge of Messenger).
> 
> The approach I took in incorporating Accelio was to build on the key abstractions
> of Messenger, Connection, and Dispatcher, and Message, and build a corresponding
> family of concrete classes:

This sounds like the right approach.  And we definitely want to clean up 
the separation of the abstract interfaces (Message, Connection, Messenger) 
from the implementations.  I'm happy to pull that stuff into the tree 
quickly once the interfaces appear stable (although it looks like your 
branch is based off lots of other linuxbox bits, so it probably isn't 
important until this gets closer to ready).

Also, it would be great to build out the simple_* test endpoints as this 
effort progresses; hopefully that can eventually form the basis of a test 
suite for the messenger and can be expanded to include various stress 
tests that don't require a full running cluster.

> XioMessenger (concrete, implements Messenger, encapsulates xio endpoints, aggregates
> dispatchers as normal).
> 
> XioConnection (concrete, implements Connection)
> 
> XioPortal (concrete, a new class that represents worker thread contexts for all XioConnections in a given XioMessenger)
> 
> XioMsg (concrete, a "transfer" class linking a sequence of low-level Accelio datagrams with a Message being sent)
> 
> XioReplyHook (concrete, derived from Ceph::Context [indirectly via Message::ReplyHook], links a sequence of low-level Accelio datagrams for a Message that has been received-- that is, part of a new "reply" abstraction exposed to Message and Messenger).
> 
> As noted above, there is some leakage of SimpleMessenger primitives into classes that are intended to be abstract, and some refactoring was needed to fit XioMessenger into the framework.  The main changes I prototyped are as follows:
> 
> All traces of Pipe are removed from Connection, which is made abstract.  A new
> PipeConnection is introduced, that knows about Pipes.  SimpleMessenger now uses
> instances of PipeConnection as its concrete connection type.
> 
> The most interesting changes I introduced are driven by the need to support
> Accelio's request/response model, which exists mainly to support RDMA memory
> registration primitives, and needs a concrete realization in the Messenger
> framework.
> 
> To accomodate it, I've introduced two concepts.  First, callers replying 
> to a Message use a new Messenger::send_reply(Message *msg, Message 
> *reply) method.  In SimpleMessenger, this just maps to a call to 
> send_message(Message *, Connection*), but in XioMessenger, the reply is 
> delivered through a new Message::reply_hook completion functor that 
> XioConnection sets when a message is being dispatched.  This is a 
> general mechanism, new Messenger implementations can derive from 
> Message::ReplyHook to define their own reply behavior, as needed.

This worries me a bit; see Greg's questions.  There are several 
request/reply patterns, but many (most?) of the message exchanges are 
asymmetrical.  I wonder if the low-level request/reply model really maps 
more closely the 'ack' stuff in SimpleMessenger (as it's about 
deallocating the sender's memory and cleaning up rdma state).

> A lot of low level details of the mapping from Message to Accelio 
> messaging are currently in flux, but the basic idea is to re-use the 
> current encode/decode primitives as far as possible, while eliding the 
> acks, sequence # and tids, and timestamp behaviors of Pipe, or rather, 
> replacing them with mappings to Accelio primitives.  I have some wrapper 
> classes that help with this.  For the moment, the existing Ceph message 
> headers and footers are still there, but are now encoded/decoded, rather 
> than hand-marshalled.  This means that checksumming is probably mostly 
> intact.  Message signatures are not implemented.
> 
> What works.  The current prototype isn't integrated with the main server daemons
> (e.g., OSD) but experimental work on that is in progress.  I've created a pair of
> simple standalone client/server applications simple_server/simple_client and
> a matching xio_server/xio_client, that provide a minimal message dispatch loop with
> a new SimpleDispatcher class and some other helpers, as a way to work with both
> messengers side-by-side.  These are currently very primitive, but will probably
> do more things soon.  The current prototype sends messages over Accelio, but has some issue
> with replies, that should be fixed shortly.  It leaks lots of memory, etc.
> 
> We've pushed a work-in-progress branch "xio-messenger" to our external github
> repository, for community review.  Find it here:
> 
> https://github.com/linuxbox2/linuxbox-ceph

Looking through this, it occurs to me that there are some other 
foundational pieces that we'll need to get in place soon:

 - The XioMessenger is a completely different wire protocol that needs to 
be distinct from the legacy protocol.  Probably we can use the 
entity_addr_t::type field for this.
 - We'll want the various *Map structures to allow multiple 
entity_addr_t's per entity.  We already could use this to support both 
IPv4 and IPv6.  In the future, though, we'll probably want clusters that 
can speak both the legacy TCP protocol (via SimpleMessenger or some 
improved implementation) and the xio one (and whatever else we dream up in 
the future).

Also, as has been mentioned previously,

 - We need to continue to migrate stuff over to the Connection-based 
Messenger interface and off the original methods that take entity_inst_t.  
The sticky bit here is the peer-to-peer mode that is used inside the OSD 
and MDS clusters: those need to handle racing connection attempts, which 
either requires the internal entity name -> connection map to resolve 
races (as we have now) or a new approach that pushes the race-resolution 
up into the calling code (meh).  No need to address it now, but eventually 
we'll need to tackle it before this can be used on the osd back-side 
network.

sage

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-11 22:58 ` Gregory Farnum
@ 2013-12-12  1:13   ` Matt W. Benjamin
  2013-12-18 22:13     ` Gregory Farnum
  0 siblings, 1 reply; 17+ messages in thread
From: Matt W. Benjamin @ 2013-12-12  1:13 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Sage Weil, Yaron Haviv, Eyal Salomon

Hi Greg,

I haven't fixed the decision to reify replies in the Messenger at this
point, but it is what the current prototype code tries to do.

The request/response model is more general than my language implied, and
also is not the only one available.  However, it is the richest model in
Accelio, and I'm currently exploring how best to exploit it.

The most general model available sends one-way messages in both directions,
and obviously looks the most like current Messenger model.  Under the covers,
both Accelio models are built on the same primitives.  The one-way model
is not incompatible with zero-copy RDMA operations, though I believe it's
at least trivially true that the one-way model uses only send/recv and
read operations.  Behind the scenes, the underlying Accelio framework
requires a more or less exchange of state between the endpoints to maintain
a balance of RDMA resources in each direction, and to complete RDMA read
and write transactions (which use registered memory at the sender/receiver,
respectively).  This isn't of course something the Messenger consumer needs
to be aware of, except precisely so as to permit the framework to know when
the upper layer operations on registered memory have completed.

As for the higher level semantics, the first thing to note is that all the
Accelio primitives provide for delivery receipts, and one of my goals is to
unify Message acks completely with recepts.  A second key point is that the
primary property of the current reply hook is not it's ability to reply, but
it's completion semantics, and these can be articulated on any of the Accelio
models.  It's possible that's all that's desired.

I'm still exploring is whether the request/response model provides additional
value to the caller that one-way would not.  The third available model would
would use xio response messages to deliver any message available at sthe
responder, so perhaps permitting greater application utilization of the
underlying resources in some circumstances.  I think a lot of this will be
clearer as I connect the XioMessenger to Ceph callers.  As we've worked on the
prototype we've found a number of places where we could tweak the Accelio APIs
to good effect, and I think we'll find more places as continue work.

Matt

----- "Gregory Farnum" <greg@inktank.com> wrote:

> On Wed, Dec 11, 2013 at 2:32 PM, Matt W. Benjamin <matt@cohortfs.com>
> wrote:
> > Hi Ceph devs,
> >
> > For the last several weeks, we've been working with engineers at
> > Mellanox on a prototype Ceph messaging implementation that runs on
> > the Accelio RDMA messaging service (libxio).
> 
> Very cool! An RDMA Messenger has been a cool-sounding project for
> which we haven't been able to get time for several years; I'm glad
> somebody is getting the chance to explore it seriously.
> 
> > Accelio is a rather new effort to build a high-performance,
> high-throughput
> > message passing framework atop openfabrics ibverbs and rdmacm
> primitives.
> >
> > It's early days, but the implementation has started to take shape,
> and
> > gives a feel for what the Accelio architecture looks like when using
> the
> > request-response model, as well as for our prototype mapping of the
> > xio framework concepts to the Ceph ones.
> >
> > The current classes and responsibility breakdown somewhat as
> follows.
> > The key classes in the TCP messaging implementation are:
> >
> > Messenger (abstract, represents a set of bidirectional communication
> endpoints)
> > SimpleMessenger (concrete TCP messenger)
> >
> > Message (abstract, models a message between endpoints, all Ceph
> protocol messages
> > derive from Message, obviously)
> >
> > Connection (concrete, though it -feels- abstract;  Connection models
> a communication
> > endpoint identifiable by address, but has -some- coupling with the
> internals of
> > SimpleMessenger, in particular, with its Pipe, below).
> >
> > Pipe (concrete, an active (threaded) object that encapsulates
> various operations on
> > one side (send or recv) of a TCP connection.  The Pipe is really
> where a -lot- of
> > the heavy lifting of SimpleMessenger is localized, and not just in
> the obvious
> > ways--eg, Pipe drives the dispatch queue in SimpleMessenger, so a
> lot of it's
> > visible semantics are built in cooperation with Pipe).
> >
> > Dispatcher (abstract, models the application processing messages and
> sending replies--ie, the upper edge of Messenger).
> 
> Good summary. You've left me feeling a little embarrassed about the
> Connection class with that description. ;)
> 
> > The approach I took in incorporating Accelio was to build on the key
> abstractions
> > of Messenger, Connection, and Dispatcher, and Message, and build a
> corresponding
> > family of concrete classes:
> >
> > XioMessenger (concrete, implements Messenger, encapsulates xio
> endpoints, aggregates
> > dispatchers as normal).
> >
> > XioConnection (concrete, implements Connection)
> >
> > XioPortal (concrete, a new class that represents worker thread
> contexts for all XioConnections in a given XioMessenger)
> >
> > XioMsg (concrete, a "transfer" class linking a sequence of low-level
> Accelio datagrams with a Message being sent)
> >
> > XioReplyHook (concrete, derived from Ceph::Context [indirectly via
> Message::ReplyHook], links a sequence of low-level Accelio datagrams
> for a Message that has been received-- that is, part of a new "reply"
> abstraction exposed to Message and Messenger).
> >
> > As noted above, there is some leakage of SimpleMessenger primitives
> into classes that are intended to be abstract, and some refactoring
> was needed to fit XioMessenger into the framework.  The main changes I
> prototyped are as follows:
> >
> > All traces of Pipe are removed from Connection, which is made
> abstract.  A new
> > PipeConnection is introduced, that knows about Pipes. 
> SimpleMessenger now uses
> > instances of PipeConnection as its concrete connection type.
> 
> This all makes sense.
> 
> > The most interesting changes I introduced are driven by the need to
> support
> > Accelio's request/response model, which exists mainly to support
> RDMA memory
> > registration primitives, and needs a concrete realization in the
> Messenger
> > framework.
> >
> > To accomodate it, I've introduced two concepts.  First, callers
> replying to a Message use a new Messenger::send_reply(Message *msg,
> Message *reply) method.  In SimpleMessenger, this just maps to a call
> to send_message(Message *, Connection*), but in XioMessenger, the
> reply is delivered through a new Message::reply_hook completion
> functor that XioConnection sets when a message is being dispatched. 
> This is a general mechanism, new Messenger implementations can derive
> from Message::ReplyHook to define their own reply behavior, as
> needed.
> 
> Can you talk more about the request/response model in the
> communication layer and why you're explicitly specifying what
> messages
> are replies to others? I'm not sure what makes that useful, or how a
> model where it is deals with stuff like
> 1) the two "ack/commit" responses to write requests, or
> 2) some of the requests in which there is not an explicit response
> message (especially OSD->monitor stuff like failure reports), or
> 3) where a request does not get a direct response message, but
> triggers a special indirect response of some kind (like the monitor
> not acking a change request explicitly, but making sure to send new
> maps to the person who requested the map change).
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com

-- 
Matt Benjamin
CohortFS, LLC.
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-12  0:59 ` Sage Weil
@ 2013-12-12  1:33   ` Matt W. Benjamin
  2013-12-12 11:32     ` Yaron Haviv
  2014-01-06 15:55     ` Atchley, Scott
  2013-12-12  2:14   ` Mark Nelson
  1 sibling, 2 replies; 17+ messages in thread
From: Matt W. Benjamin @ 2013-12-12  1:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Yaron Haviv, Eyal Salomon

HI Sage,

inline

----- "Sage Weil" <sage@inktank.com> wrote:

> Hi Matt,
> 
> Thanks for posting this!  Some comments and questions below.
> 
> 
> I was originally thinking that xio was going to be more
> mellanox-specific, 
> but it looks like it runs over multiple transports (even tcp!).  (I'm
> sure 
> I've been told this before but it apparently didn't sink in.)  Is
> there 
> also a mellanox-specific backend (that is not ibverbs) that takes any
>  
> special advantage of mellanox hw capabilities?

The actual situation is that xio is currently ibverbs specific, though
there is interest with Mellanox and some partners in building a TCP
transport for it.

What is true is that xio makes very advanced use of ibverbs interfaces,
lock free/wait-free allocators, rdtsc, but hides a lot of details from
upper layers.  The xio designers knew how to get the most from infiniband/
RDMA, and it shows.

Also, ibverbs is a first-class interface to iWARP and esp.
ROCE hardware, as well as ib.  I've been doing most of my development on
a tweaked version of the softiwarp ib provider, which amounts to a full
RDMA simulator that runs on anything.  (Apparently it can run over TCP,
but I just use it on one vm host.)

I haven't worked with cci, but just glancing at it, I'd say xio stacks
up very well on ibverbs, but won't solve the TCP transport problem
immediately.

> 
> Similarly, are there other projects or vendors that are looking at xio
> at 
> this point?

Mellanox partners are working with it mainly, I believe.

>  I've seen similar attempts to create this sort of library
> 
> (CCI comes to mind: https://github.com/CCI/cci).  Have these previous
> 
> attempts influenced the design of xio at all?

> > 
> > The approach I took in incorporating Accelio was to build on the key
> abstractions
> > of Messenger, Connection, and Dispatcher, and Message, and build a
> corresponding
> > family of concrete classes:
> 
> This sounds like the right approach.  And we definitely want to clean
> up 
> the separation of the abstract interfaces (Message, Connection,
> Messenger) 
> from the implementations.  I'm happy to pull that stuff into the tree
> 
> quickly once the interfaces appear stable (although it looks like your
> 
> branch is based off lots of other linuxbox bits, so it probably isn't
> 
> important until this gets closer to ready).

Ok, cool.

> 
> Also, it would be great to build out the simple_* test endpoints as
> this 
> effort progresses; hopefully that can eventually form the basis of a
> test 
> suite for the messenger and can be expanded to include various stress
> 
> tests that don't require a full running cluster.

I agree.  I intend to have it at least running more Message types
RSN.

> 
> > XioMessenger (concrete, implements Messenger, encapsulates xio
> endpoints, aggregates

Agreed, I respond to this point in more detail in my reply to Greg's
message.

> 
> This worries me a bit; see Greg's questions.  There are several 
> request/reply patterns, but many (most?) of the message exchanges are
> 
> asymmetrical.  I wonder if the low-level request/reply model really
> maps 
> more closely the 'ack' stuff in SimpleMessenger (as it's about 
> deallocating the sender's memory and cleaning up rdma state).
> 
> > A lot of low level details of the mapping from Message to Accelio 
> > messaging are currently in flux, but the basic idea is to re-use the
> 
> > current encode/decode primitives as far as possible, while eliding
> the 
> > acks, sequence # and tids, and timestamp behaviors of Pipe, or
> rather, 
> > replacing them with mappings to Accelio primitives.  I have some
> wrapper 
> > classes that help with this.  For the moment, the existing Ceph
> message 
> > headers and footers are still there, but are now encoded/decoded,
> rather 
> > than hand-marshalled.  This means that checksumming is probably
> mostly 
> > intact.  Message signatures are not implemented.
> > 
> > What works.  The current prototype isn't integrated with the main
> server daemons
> > (e.g., OSD) but experimental work on that is in progress.  I've
> created a pair of
> > simple standalone client/server applications
> simple_server/simple_client and
> > a matching xio_server/xio_client, that provide a minimal message
> dispatch loop with
> > a new SimpleDispatcher class and some other helpers, as a way to
> work with both
> > messengers side-by-side.  These are currently very primitive, but
> will probably
> > do more things soon.  The current prototype sends messages over
> Accelio, but has some issue
> > with replies, that should be fixed shortly.  It leaks lots of
> memory, etc.
> > 
> > We've pushed a work-in-progress branch "xio-messenger" to our
> external github
> > repository, for community review.  Find it here:
> > 
> > https://github.com/linuxbox2/linuxbox-ceph
> 
> Looking through this, it occurs to me that there are some other 
> foundational pieces that we'll need to get in place soon:
> 
>  - The XioMessenger is a completely different wire protocol that needs
> to 
> be distinct from the legacy protocol.  Probably we can use the 
> entity_addr_t::type field for this.
>  - We'll want the various *Map structures to allow multiple 
> entity_addr_t's per entity.  We already could use this to support both
> 
> IPv4 and IPv6.  In the future, though, we'll probably want clusters
> that 
> can speak both the legacy TCP protocol (via SimpleMessenger or some 
> improved implementation) and the xio one (and whatever else we dream
> up in 
> the future).

Ack.

> 
> Also, as has been mentioned previously,
> 
>  - We need to continue to migrate stuff over to the Connection-based 
> Messenger interface and off the original methods that take
> entity_inst_t.  
> The sticky bit here is the peer-to-peer mode that is used inside the
> OSD 
> and MDS clusters: those need to handle racing connection attempts,
> which 
> either requires the internal entity name -> connection map to resolve
> 
> races (as we have now) or a new approach that pushes the
> race-resolution 
> up into the calling code (meh).  No need to address it now, but
> eventually 
> we'll need to tackle it before this can be used on the osd back-side 
> network.
> 
> sage

-- 
Matt Benjamin
CohortFS, LLC.
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-12  0:59 ` Sage Weil
  2013-12-12  1:33   ` Matt W. Benjamin
@ 2013-12-12  2:14   ` Mark Nelson
  1 sibling, 0 replies; 17+ messages in thread
From: Mark Nelson @ 2013-12-12  2:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: Matt W. Benjamin, ceph-devel, Yaron Haviv, Eyal Salomon

On 12/11/2013 06:59 PM, Sage Weil wrote:
> Hi Matt,
>
> Thanks for posting this!  Some comments and questions below.
>
> On Wed, 11 Dec 2013, Matt W. Benjamin wrote:
>> Hi Ceph devs,
>>
>> For the last several weeks, we've been working with engineers at
>> Mellanox on a prototype Ceph messaging implementation that runs on
>> the Accelio RDMA messaging service (libxio).
>>
>> Accelio is a rather new effort to build a high-performance, high-throughput
>> message passing framework atop openfabrics ibverbs and rdmacm primitives.
>
> I was originally thinking that xio was going to be more mellanox-specific,
> but it looks like it runs over multiple transports (even tcp!).  (I'm sure
> I've been told this before but it apparently didn't sink in.)  Is there
> also a mellanox-specific backend (that is not ibverbs) that takes any
> special advantage of mellanox hw capabilities?
>
> Similarly, are there other projects or vendors that are looking at xio at
> this point?  I've seen similar attempts to create this sort of library
> (CCI comes to mind: https://github.com/CCI/cci).  Have these previous
> attempts influenced the design of xio at all?
>
>> It's early days, but the implementation has started to take shape, and
>> gives a feel for what the Accelio architecture looks like when using the
>> request-response model, as well as for our prototype mapping of the
>> xio framework concepts to the Ceph ones.
>>
>> The current classes and responsibility breakdown somewhat as follows.
>> The key classes in the TCP messaging implementation are:
>>
>> Messenger (abstract, represents a set of bidirectional communication endpoints)
>> SimpleMessenger (concrete TCP messenger)
>>
>> Message (abstract, models a message between endpoints, all Ceph protocol messages
>> derive from Message, obviously)
>>
>> Connection (concrete, though it -feels- abstract;  Connection models a communication
>> endpoint identifiable by address, but has -some- coupling with the internals of
>> SimpleMessenger, in particular, with its Pipe, below).
>>
>> Pipe (concrete, an active (threaded) object that encapsulates various operations on
>> one side (send or recv) of a TCP connection.  The Pipe is really where a -lot- of
>> the heavy lifting of SimpleMessenger is localized, and not just in the obvious
>> ways--eg, Pipe drives the dispatch queue in SimpleMessenger, so a lot of it's
>> visible semantics are built in cooperation with Pipe).
>>
>> Dispatcher (abstract, models the application processing messages and sending replies--ie, the upper edge of Messenger).
>>
>> The approach I took in incorporating Accelio was to build on the key abstractions
>> of Messenger, Connection, and Dispatcher, and Message, and build a corresponding
>> family of concrete classes:
>
> This sounds like the right approach.  And we definitely want to clean up
> the separation of the abstract interfaces (Message, Connection, Messenger)
> from the implementations.  I'm happy to pull that stuff into the tree
> quickly once the interfaces appear stable (although it looks like your
> branch is based off lots of other linuxbox bits, so it probably isn't
> important until this gets closer to ready).
>
> Also, it would be great to build out the simple_* test endpoints as this
> effort progresses; hopefully that can eventually form the basis of a test
> suite for the messenger and can be expanded to include various stress
> tests that don't require a full running cluster.

Yes please!  One of the things I've desperately wanted is to be able to 
do blackbox testing at different points in the data path.  We've got 
smalliobench, but independently testing the client, messenger, radosgw, 
etc would all be incredibly useful.

>
>> XioMessenger (concrete, implements Messenger, encapsulates xio endpoints, aggregates
>> dispatchers as normal).
>>
>> XioConnection (concrete, implements Connection)
>>
>> XioPortal (concrete, a new class that represents worker thread contexts for all XioConnections in a given XioMessenger)
>>
>> XioMsg (concrete, a "transfer" class linking a sequence of low-level Accelio datagrams with a Message being sent)
>>
>> XioReplyHook (concrete, derived from Ceph::Context [indirectly via Message::ReplyHook], links a sequence of low-level Accelio datagrams for a Message that has been received-- that is, part of a new "reply" abstraction exposed to Message and Messenger).
>>
>> As noted above, there is some leakage of SimpleMessenger primitives into classes that are intended to be abstract, and some refactoring was needed to fit XioMessenger into the framework.  The main changes I prototyped are as follows:
>>
>> All traces of Pipe are removed from Connection, which is made abstract.  A new
>> PipeConnection is introduced, that knows about Pipes.  SimpleMessenger now uses
>> instances of PipeConnection as its concrete connection type.
>>
>> The most interesting changes I introduced are driven by the need to support
>> Accelio's request/response model, which exists mainly to support RDMA memory
>> registration primitives, and needs a concrete realization in the Messenger
>> framework.
>>
>> To accomodate it, I've introduced two concepts.  First, callers replying
>> to a Message use a new Messenger::send_reply(Message *msg, Message
>> *reply) method.  In SimpleMessenger, this just maps to a call to
>> send_message(Message *, Connection*), but in XioMessenger, the reply is
>> delivered through a new Message::reply_hook completion functor that
>> XioConnection sets when a message is being dispatched.  This is a
>> general mechanism, new Messenger implementations can derive from
>> Message::ReplyHook to define their own reply behavior, as needed.
>
> This worries me a bit; see Greg's questions.  There are several
> request/reply patterns, but many (most?) of the message exchanges are
> asymmetrical.  I wonder if the low-level request/reply model really maps
> more closely the 'ack' stuff in SimpleMessenger (as it's about
> deallocating the sender's memory and cleaning up rdma state).
>
>> A lot of low level details of the mapping from Message to Accelio
>> messaging are currently in flux, but the basic idea is to re-use the
>> current encode/decode primitives as far as possible, while eliding the
>> acks, sequence # and tids, and timestamp behaviors of Pipe, or rather,
>> replacing them with mappings to Accelio primitives.  I have some wrapper
>> classes that help with this.  For the moment, the existing Ceph message
>> headers and footers are still there, but are now encoded/decoded, rather
>> than hand-marshalled.  This means that checksumming is probably mostly
>> intact.  Message signatures are not implemented.
>>
>> What works.  The current prototype isn't integrated with the main server daemons
>> (e.g., OSD) but experimental work on that is in progress.  I've created a pair of
>> simple standalone client/server applications simple_server/simple_client and
>> a matching xio_server/xio_client, that provide a minimal message dispatch loop with
>> a new SimpleDispatcher class and some other helpers, as a way to work with both
>> messengers side-by-side.  These are currently very primitive, but will probably
>> do more things soon.  The current prototype sends messages over Accelio, but has some issue
>> with replies, that should be fixed shortly.  It leaks lots of memory, etc.
>>
>> We've pushed a work-in-progress branch "xio-messenger" to our external github
>> repository, for community review.  Find it here:
>>
>> https://github.com/linuxbox2/linuxbox-ceph
>
> Looking through this, it occurs to me that there are some other
> foundational pieces that we'll need to get in place soon:
>
>   - The XioMessenger is a completely different wire protocol that needs to
> be distinct from the legacy protocol.  Probably we can use the
> entity_addr_t::type field for this.
>   - We'll want the various *Map structures to allow multiple
> entity_addr_t's per entity.  We already could use this to support both
> IPv4 and IPv6.  In the future, though, we'll probably want clusters that
> can speak both the legacy TCP protocol (via SimpleMessenger or some
> improved implementation) and the xio one (and whatever else we dream up in
> the future).
>
> Also, as has been mentioned previously,
>
>   - We need to continue to migrate stuff over to the Connection-based
> Messenger interface and off the original methods that take entity_inst_t.
> The sticky bit here is the peer-to-peer mode that is used inside the OSD
> and MDS clusters: those need to handle racing connection attempts, which
> either requires the internal entity name -> connection map to resolve
> races (as we have now) or a new approach that pushes the race-resolution
> up into the calling code (meh).  No need to address it now, but eventually
> we'll need to tackle it before this can be used on the osd back-side
> network.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-11 22:32 Ceph Messaging on Accelio (libxio) RDMA Matt W. Benjamin
  2013-12-11 22:58 ` Gregory Farnum
  2013-12-12  0:59 ` Sage Weil
@ 2013-12-12 10:19 ` Kasper Dieter
  2013-12-12 10:43   ` Yaron Haviv
  2 siblings, 1 reply; 17+ messages in thread
From: Kasper Dieter @ 2013-12-12 10:19 UTC (permalink / raw)
  To: Matt W. Benjamin; +Cc: ceph-devel, Sage Weil, Yaron Haviv, Eyal Salomon

Hi Matt,

how will you solve the support of the kernel clients through libceph.ko
with Accelio/libxio ?

Best Regards,
-Dieter

On Wed, Dec 11, 2013 at 11:32:28PM +0100, Matt W. Benjamin wrote:
> Hi Ceph devs,
> 
> For the last several weeks, we've been working with engineers at
> Mellanox on a prototype Ceph messaging implementation that runs on
> the Accelio RDMA messaging service (libxio).
> 
> Accelio is a rather new effort to build a high-performance, high-throughput
> message passing framework atop openfabrics ibverbs and rdmacm primitives.
> 
> It's early days, but the implementation has started to take shape, and
> gives a feel for what the Accelio architecture looks like when using the
> request-response model, as well as for our prototype mapping of the
> xio framework concepts to the Ceph ones.
> 
> The current classes and responsibility breakdown somewhat as follows.
> The key classes in the TCP messaging implementation are:
> 
> Messenger (abstract, represents a set of bidirectional communication endpoints)
> SimpleMessenger (concrete TCP messenger)
> 
> Message (abstract, models a message between endpoints, all Ceph protocol messages
> derive from Message, obviously)
> 
> Connection (concrete, though it -feels- abstract;  Connection models a communication
> endpoint identifiable by address, but has -some- coupling with the internals of
> SimpleMessenger, in particular, with its Pipe, below).
> 
> Pipe (concrete, an active (threaded) object that encapsulates various operations on
> one side (send or recv) of a TCP connection.  The Pipe is really where a -lot- of
> the heavy lifting of SimpleMessenger is localized, and not just in the obvious
> ways--eg, Pipe drives the dispatch queue in SimpleMessenger, so a lot of it's
> visible semantics are built in cooperation with Pipe).
> 
> Dispatcher (abstract, models the application processing messages and sending replies--ie, the upper edge of Messenger).
> 
> The approach I took in incorporating Accelio was to build on the key abstractions
> of Messenger, Connection, and Dispatcher, and Message, and build a corresponding
> family of concrete classes:
> 
> XioMessenger (concrete, implements Messenger, encapsulates xio endpoints, aggregates
> dispatchers as normal).
> 
> XioConnection (concrete, implements Connection)
> 
> XioPortal (concrete, a new class that represents worker thread contexts for all XioConnections in a given XioMessenger)
> 
> XioMsg (concrete, a "transfer" class linking a sequence of low-level Accelio datagrams with a Message being sent)
> 
> XioReplyHook (concrete, derived from Ceph::Context [indirectly via Message::ReplyHook], links a sequence of low-level Accelio datagrams for a Message that has been received-- that is, part of a new "reply" abstraction exposed to Message and Messenger).
> 
> As noted above, there is some leakage of SimpleMessenger primitives into classes that are intended to be abstract, and some refactoring was needed to fit XioMessenger into the framework.  The main changes I prototyped are as follows:
> 
> All traces of Pipe are removed from Connection, which is made abstract.  A new
> PipeConnection is introduced, that knows about Pipes.  SimpleMessenger now uses
> instances of PipeConnection as its concrete connection type.
> 
> The most interesting changes I introduced are driven by the need to support
> Accelio's request/response model, which exists mainly to support RDMA memory
> registration primitives, and needs a concrete realization in the Messenger
> framework.
> 
> To accomodate it, I've introduced two concepts.  First, callers replying to a Message use a new Messenger::send_reply(Message *msg, Message *reply) method.  In SimpleMessenger, this just maps to a call to send_message(Message *, Connection*), but in XioMessenger, the reply is delivered through a new Message::reply_hook completion functor that XioConnection sets when a message is being dispatched.  This is a general mechanism, new Messenger implementations can derive from Message::ReplyHook to define their own reply behavior, as needed.
> 
> A lot of low level details of the mapping from Message to Accelio messaging are
> currently in flux, but the basic idea is to re-use the current encode/decode primitives as far as possible, while eliding the acks, sequence # and tids, and timestamp behaviors of Pipe, or rather, replacing them with mappings to Accelio primitives.  I have some wrapper classes that help with this.  For the moment, the existing Ceph message headers and footers are still there, but are now encoded/decoded, rather than hand-marshalled.  This means that checksumming 
> is probably mostly intact.  Message signatures are not implemented.
> 
> What works.  The current prototype isn't integrated with the main server daemons
> (e.g., OSD) but experimental work on that is in progress.  I've created a pair of
> simple standalone client/server applications simple_server/simple_client and
> a matching xio_server/xio_client, that provide a minimal message dispatch loop with
> a new SimpleDispatcher class and some other helpers, as a way to work with both
> messengers side-by-side.  These are currently very primitive, but will probably
> do more things soon.  The current prototype sends messages over Accelio, but has some issue
> with replies, that should be fixed shortly.  It leaks lots of memory, etc.
> 
> We've pushed a work-in-progress branch "xio-messenger" to our external github
> repository, for community review.  Find it here:
> 
> https://github.com/linuxbox2/linuxbox-ceph
> 
> Thanks!
> 
> Matt
> 
> -- 
> Matt Benjamin
> CohortFS, LLC.
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
> 
> http://cohortfs.com
> 
> tel.  734-761-4689 
> fax.  734-769-8938 
> cel.  734-216-5309 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-12 10:19 ` Kasper Dieter
@ 2013-12-12 10:43   ` Yaron Haviv
  0 siblings, 0 replies; 17+ messages in thread
From: Yaron Haviv @ 2013-12-12 10:43 UTC (permalink / raw)
  To: Kasper Dieter, Matt W. Benjamin; +Cc: ceph-devel, Sage Weil, Eyal Salomon

> -----Original Message-----
> From: Kasper Dieter [mailto:dieter.kasper@ts.fujitsu.com]
> Sent: Thursday, December 12, 2013 12:19 PM
> To: Matt W. Benjamin
> Cc: ceph-devel; Sage Weil; Yaron Haviv; Eyal Salomon
> Subject: Re: Ceph Messaging on Accelio (libxio) RDMA
> 
> Hi Matt,
> 
> how will you solve the support of the kernel clients through libceph.ko with
> Accelio/libxio ?
[YH> ] 
Dieter, hi, there is also early  kAccelio version 
and we plan to address kernel Ceph client later on

note that there is a growing momentum for using Accelio in different projects e.g. there is early HDFS version over JXIO (Accelio Java bindings), several leading storage and database vendors adopt it as their clustering middleware, we plan to add a TCP and Shared-Mem transport next year so user can enjoy the same architectural benefits (high-speed, lock free async message/RPC API, zero-copy, Multi-path, ..)  over variety of transports, this can also be leveraged by Ceph so we won't need transport switch in the upper layers     

Yaron

> 
> Best Regards,
> -Dieter
> 
> On Wed, Dec 11, 2013 at 11:32:28PM +0100, Matt W. Benjamin wrote:
> > Hi Ceph devs,
> >
> > For the last several weeks, we've been working with engineers at
> > Mellanox on a prototype Ceph messaging implementation that runs on the
> > Accelio RDMA messaging service (libxio).
> >
> > Accelio is a rather new effort to build a high-performance,
> > high-throughput message passing framework atop openfabrics ibverbs and
> rdmacm primitives.
> >
> > It's early days, but the implementation has started to take shape, and
> > gives a feel for what the Accelio architecture looks like when using
> > the request-response model, as well as for our prototype mapping of
> > the xio framework concepts to the Ceph ones.
> >
> > The current classes and responsibility breakdown somewhat as follows.
> > The key classes in the TCP messaging implementation are:
> >
> > Messenger (abstract, represents a set of bidirectional communication
> > endpoints) SimpleMessenger (concrete TCP messenger)
> >
> > Message (abstract, models a message between endpoints, all Ceph
> > protocol messages derive from Message, obviously)
> >
> > Connection (concrete, though it -feels- abstract;  Connection models a
> > communication endpoint identifiable by address, but has -some-
> > coupling with the internals of SimpleMessenger, in particular, with its Pipe,
> below).
> >
> > Pipe (concrete, an active (threaded) object that encapsulates various
> > operations on one side (send or recv) of a TCP connection.  The Pipe
> > is really where a -lot- of the heavy lifting of SimpleMessenger is
> > localized, and not just in the obvious ways--eg, Pipe drives the
> > dispatch queue in SimpleMessenger, so a lot of it's visible semantics are
> built in cooperation with Pipe).
> >
> > Dispatcher (abstract, models the application processing messages and
> sending replies--ie, the upper edge of Messenger).
> >
> > The approach I took in incorporating Accelio was to build on the key
> > abstractions of Messenger, Connection, and Dispatcher, and Message,
> > and build a corresponding family of concrete classes:
> >
> > XioMessenger (concrete, implements Messenger, encapsulates xio
> > endpoints, aggregates dispatchers as normal).
> >
> > XioConnection (concrete, implements Connection)
> >
> > XioPortal (concrete, a new class that represents worker thread
> > contexts for all XioConnections in a given XioMessenger)
> >
> > XioMsg (concrete, a "transfer" class linking a sequence of low-level
> > Accelio datagrams with a Message being sent)
> >
> > XioReplyHook (concrete, derived from Ceph::Context [indirectly via
> Message::ReplyHook], links a sequence of low-level Accelio datagrams for a
> Message that has been received-- that is, part of a new "reply" abstraction
> exposed to Message and Messenger).
> >
> > As noted above, there is some leakage of SimpleMessenger primitives into
> classes that are intended to be abstract, and some refactoring was needed to
> fit XioMessenger into the framework.  The main changes I prototyped are as
> follows:
> >
> > All traces of Pipe are removed from Connection, which is made
> > abstract.  A new PipeConnection is introduced, that knows about Pipes.
> > SimpleMessenger now uses instances of PipeConnection as its concrete
> connection type.
> >
> > The most interesting changes I introduced are driven by the need to
> > support Accelio's request/response model, which exists mainly to
> > support RDMA memory registration primitives, and needs a concrete
> > realization in the Messenger framework.
> >
> > To accomodate it, I've introduced two concepts.  First, callers replying to a
> Message use a new Messenger::send_reply(Message *msg, Message
> *reply) method.  In SimpleMessenger, this just maps to a call to
> send_message(Message *, Connection*), but in XioMessenger, the reply is
> delivered through a new Message::reply_hook completion functor that
> XioConnection sets when a message is being dispatched.  This is a general
> mechanism, new Messenger implementations can derive from
> Message::ReplyHook to define their own reply behavior, as needed.
> >
> > A lot of low level details of the mapping from Message to Accelio
> > messaging are currently in flux, but the basic idea is to re-use the
> > current encode/decode primitives as far as possible, while eliding the acks,
> sequence # and tids, and timestamp behaviors of Pipe, or rather, replacing
> them with mappings to Accelio primitives.  I have some wrapper classes that
> help with this.  For the moment, the existing Ceph message headers and
> footers are still there, but are now encoded/decoded, rather than hand-
> marshalled.  This means that checksumming is probably mostly intact.
> Message signatures are not implemented.
> >
> > What works.  The current prototype isn't integrated with the main
> > server daemons (e.g., OSD) but experimental work on that is in
> > progress.  I've created a pair of simple standalone client/server
> > applications simple_server/simple_client and a matching
> > xio_server/xio_client, that provide a minimal message dispatch loop
> > with a new SimpleDispatcher class and some other helpers, as a way to
> > work with both messengers side-by-side.  These are currently very
> > primitive, but will probably do more things soon.  The current prototype
> sends messages over Accelio, but has some issue with replies, that should be
> fixed shortly.  It leaks lots of memory, etc.
> >
> > We've pushed a work-in-progress branch "xio-messenger" to our external
> > github repository, for community review.  Find it here:
> >
> > https://github.com/linuxbox2/linuxbox-ceph
> >
> > Thanks!
> >
> > Matt
> >
> > --
> > Matt Benjamin
> > CohortFS, LLC.
> > 206 South Fifth Ave. Suite 150
> > Ann Arbor, MI  48104
> >
> > http://cohortfs.com
> >
> > tel.  734-761-4689
> > fax.  734-769-8938
> > cel.  734-216-5309
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-12  1:33   ` Matt W. Benjamin
@ 2013-12-12 11:32     ` Yaron Haviv
  2014-01-06 15:55     ` Atchley, Scott
  1 sibling, 0 replies; 17+ messages in thread
From: Yaron Haviv @ 2013-12-12 11:32 UTC (permalink / raw)
  To: Matt W. Benjamin, Sage Weil; +Cc: ceph-devel, Eyal Salomon

to shed more light about Accelio see some notes below 

Yaron

> -----Original Message-----
> From: Matt W. Benjamin [mailto:matt@cohortfs.com]
> Sent: Thursday, December 12, 2013 3:34 AM
> To: Sage Weil
> Cc: ceph-devel; Yaron Haviv; Eyal Salomon
> Subject: Re: Ceph Messaging on Accelio (libxio) RDMA
> 
> HI Sage,
> 
> inline
> 
> ----- "Sage Weil" <sage@inktank.com> wrote:
> 
> > Hi Matt,
> >
> > Thanks for posting this!  Some comments and questions below.
> >
> >
> > I was originally thinking that xio was going to be more
> > mellanox-specific, but it looks like it runs over multiple transports
> > (even tcp!).  (I'm sure I've been told this before but it apparently
> > didn't sink in.)  Is there also a mellanox-specific backend (that is
> > not ibverbs) that takes any
> >
> > special advantage of mellanox hw capabilities?
> 
[YH> ] 

note that Accelio is hardware independent, works over different RDMA transports (IB, RoCE, iWarp, ..) and will add non RDMA transports 
it is entirely open source and contains contributions from multiple vendors, see:  https://github.com/accelio/accelio
variety of code examples in: https://github.com/accelio/accelio/tree/master/examples/usr

there are many cool transport optimizations and advanced functionality built into it, but are abstracted from the end user allowing best performance with rapid development, see more details in : http://www.accelio.org/wp-content/themes/pyramid_child/pdf/WP_Accelio_OpenSource_IO_Message_and_RPC_Acceleration_Library.pdf


> The actual situation is that xio is currently ibverbs specific, though there is
> interest with Mellanox and some partners in building a TCP transport for it.
> 
[YH> ] 

TCP will be added early next year, the transport abstraction/plug-in mechanism is already implemented 
if someone want to help in that, we are open to it :) 

> What is true is that xio makes very advanced use of ibverbs interfaces, lock
> free/wait-free allocators, rdtsc, but hides a lot of details from upper layers.
> The xio designers knew how to get the most from infiniband/ RDMA, and it
> shows.
> 
[YH> ] 

note that Accelio is faster than using ibverbs directly, since it does many optimizations on the way it use the API, e.g. amortize HW calls, avoid locks, avoid memory coherency and locality issues .. , we can get today ~1.5M TP/s (Req+Rep)  per thread and many million TP/s with multiple threads 

> Also, ibverbs is a first-class interface to iWARP and esp.
> ROCE hardware, as well as ib.  I've been doing most of my development on a
> tweaked version of the softiwarp ib provider, which amounts to a full RDMA
> simulator that runs on anything.  (Apparently it can run over TCP, but I just
> use it on one vm host.)
> 
> I haven't worked with cci, but just glancing at it, I'd say xio stacks up very well
> on ibverbs, but won't solve the TCP transport problem immediately.
> 
> >
> > Similarly, are there other projects or vendors that are looking at xio
> > at this point?
> 
> Mellanox partners are working with it mainly, I believe.
> 
[YH> ] 
several open source projects will incorporate Accelio as middleware (e.g. HDFS), and many storage/database vendors are adopting it, and variety of end-users plan to use the C or Java APIs  ( Java Accelio performs like the C code with 1.5M TP/s per thread, since all the transport is in hardware and it doesn’t contain context switches or locks)  

> >  I've seen similar attempts to create this sort of library
> >
> > (CCI comes to mind: https://github.com/CCI/cci).  Have these previous
> >
> > attempts influenced the design of xio at all?
> 
> > >
> > > The approach I took in incorporating Accelio was to build on the key
> > abstractions
> > > of Messenger, Connection, and Dispatcher, and Message, and build a
> > corresponding
> > > family of concrete classes:
> >
> > This sounds like the right approach.  And we definitely want to clean
> > up the separation of the abstract interfaces (Message, Connection,
> > Messenger)
> > from the implementations.  I'm happy to pull that stuff into the tree
> >
> > quickly once the interfaces appear stable (although it looks like your
> >
> > branch is based off lots of other linuxbox bits, so it probably isn't
> >
> > important until this gets closer to ready).
> 
> Ok, cool.
> 
> >
> > Also, it would be great to build out the simple_* test endpoints as
> > this effort progresses; hopefully that can eventually form the basis
> > of a test suite for the messenger and can be expanded to include
> > various stress
> >
> > tests that don't require a full running cluster.
> 
> I agree.  I intend to have it at least running more Message types RSN.
> 
> >
> > > XioMessenger (concrete, implements Messenger, encapsulates xio
> > endpoints, aggregates
> 
> Agreed, I respond to this point in more detail in my reply to Greg's message.
> 
> >
> > This worries me a bit; see Greg's questions.  There are several
> > request/reply patterns, but many (most?) of the message exchanges are
> >
> > asymmetrical.  I wonder if the low-level request/reply model really
> > maps more closely the 'ack' stuff in SimpleMessenger (as it's about
> > deallocating the sender's memory and cleaning up rdma state).
> >
> > > A lot of low level details of the mapping from Message to Accelio
> > > messaging are currently in flux, but the basic idea is to re-use the
> >
> > > current encode/decode primitives as far as possible, while eliding
> > the
> > > acks, sequence # and tids, and timestamp behaviors of Pipe, or
> > rather,
> > > replacing them with mappings to Accelio primitives.  I have some
> > wrapper
> > > classes that help with this.  For the moment, the existing Ceph
> > message
> > > headers and footers are still there, but are now encoded/decoded,
> > rather
> > > than hand-marshalled.  This means that checksumming is probably
> > mostly
> > > intact.  Message signatures are not implemented.
> > >
> > > What works.  The current prototype isn't integrated with the main
> > server daemons
> > > (e.g., OSD) but experimental work on that is in progress.  I've
> > created a pair of
> > > simple standalone client/server applications
> > simple_server/simple_client and
> > > a matching xio_server/xio_client, that provide a minimal message
> > dispatch loop with
> > > a new SimpleDispatcher class and some other helpers, as a way to
> > work with both
> > > messengers side-by-side.  These are currently very primitive, but
> > will probably
> > > do more things soon.  The current prototype sends messages over
> > Accelio, but has some issue
> > > with replies, that should be fixed shortly.  It leaks lots of
> > memory, etc.
> > >
> > > We've pushed a work-in-progress branch "xio-messenger" to our
> > external github
> > > repository, for community review.  Find it here:
> > >
> > > https://github.com/linuxbox2/linuxbox-ceph
> >
> > Looking through this, it occurs to me that there are some other
> > foundational pieces that we'll need to get in place soon:
> >
> >  - The XioMessenger is a completely different wire protocol that needs
> > to be distinct from the legacy protocol.  Probably we can use the
> > entity_addr_t::type field for this.
> >  - We'll want the various *Map structures to allow multiple
> > entity_addr_t's per entity.  We already could use this to support both
> >
> > IPv4 and IPv6.  In the future, though, we'll probably want clusters
> > that can speak both the legacy TCP protocol (via SimpleMessenger or
> > some improved implementation) and the xio one (and whatever else we
> > dream up in the future).
> 
> Ack.
> 
> >
> > Also, as has been mentioned previously,
> >
> >  - We need to continue to migrate stuff over to the Connection-based
> > Messenger interface and off the original methods that take
> > entity_inst_t.
> > The sticky bit here is the peer-to-peer mode that is used inside the
> > OSD and MDS clusters: those need to handle racing connection attempts,
> > which either requires the internal entity name -> connection map to
> > resolve
> >
> > races (as we have now) or a new approach that pushes the
> > race-resolution up into the calling code (meh).  No need to address it
> > now, but eventually we'll need to tackle it before this can be used on
> > the osd back-side network.
> >
> > sage
> 
> --
> Matt Benjamin
> CohortFS, LLC.
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
> 
> http://cohortfs.com
> 
> tel.  734-761-4689
> fax.  734-769-8938
> cel.  734-216-5309

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-12  1:13   ` Matt W. Benjamin
@ 2013-12-18 22:13     ` Gregory Farnum
  0 siblings, 0 replies; 17+ messages in thread
From: Gregory Farnum @ 2013-12-18 22:13 UTC (permalink / raw)
  To: Matt W. Benjamin; +Cc: ceph-devel, Sage Weil, Yaron Haviv, Eyal Salomon

(Sorry for the delay getting back on this.)

On Wed, Dec 11, 2013 at 5:13 PM, Matt W. Benjamin <matt@cohortfs.com> wrote:
> Hi Greg,
>
> I haven't fixed the decision to reify replies in the Messenger at this
> point, but it is what the current prototype code tries to do.
>
> The request/response model is more general than my language implied, and
> also is not the only one available.  However, it is the richest model in
> Accelio, and I'm currently exploring how best to exploit it.
>
> The most general model available sends one-way messages in both directions,
> and obviously looks the most like current Messenger model.  Under the covers,
> both Accelio models are built on the same primitives.  The one-way model
> is not incompatible with zero-copy RDMA operations, though I believe it's
> at least trivially true that the one-way model uses only send/recv and
> read operations.  Behind the scenes, the underlying Accelio framework
> requires a more or less exchange of state between the endpoints to maintain
> a balance of RDMA resources in each direction, and to complete RDMA read
> and write transactions (which use registered memory at the sender/receiver,
> respectively).

This sounds more like the acks which the SimpleMessenger already does
(unless I'm misunderstanding what you mean), in that the recipient has
to tell the sender "I have received this message and you don't need to
buffer it any more". Surely we want to let them move on before we've
(for instance) written the submitted data to disk? Or is it also about
telling the sender when they can re-consume the memory in the
recipient's box?

> This isn't of course something the Messenger consumer needs
> to be aware of, except precisely so as to permit the framework to know when
> the upper layer operations on registered memory have completed.
>
> As for the higher level semantics, the first thing to note is that all the
> Accelio primitives provide for delivery receipts, and one of my goals is to
> unify Message acks completely with recepts.  A second key point is that the
> primary property of the current reply hook is not it's ability to reply, but
> it's completion semantics, and these can be articulated on any of the Accelio
> models.  It's possible that's all that's desired.
>
> I'm still exploring is whether the request/response model provides additional
> value to the caller that one-way would not.  The third available model would
> would use xio response messages to deliver any message available at sthe
> responder, so perhaps permitting greater application utilization of the
> underlying resources in some circumstances.  I think a lot of this will be
> clearer as I connect the XioMessenger to Ceph callers.  As we've worked on the
> prototype we've found a number of places where we could tweak the Accelio APIs
> to good effect, and I think we'll find more places as continue work.

I'm still a little fuzzy about how we'd even use a request-reply model
when there are two (or zero) replies to a given request.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2013-12-12  1:33   ` Matt W. Benjamin
  2013-12-12 11:32     ` Yaron Haviv
@ 2014-01-06 15:55     ` Atchley, Scott
  2014-01-07 19:52       ` Yaron Haviv
  1 sibling, 1 reply; 17+ messages in thread
From: Atchley, Scott @ 2014-01-06 15:55 UTC (permalink / raw)
  To: Matt W. Benjamin; +Cc: Sage Weil, ceph-devel, Yaron Haviv, Eyal Salomon

On Dec 11, 2013, at 8:33 PM, Matt W. Benjamin <matt@cohortfs.com> wrote:

> HI Sage,
> 
> inline
> 
> ----- "Sage Weil" <sage@inktank.com> wrote:
> 
>> Hi Matt,
>> 
>> Thanks for posting this!  Some comments and questions below.
>> 
>> 
>> I was originally thinking that xio was going to be more
>> mellanox-specific, 
>> but it looks like it runs over multiple transports (even tcp!).  (I'm
>> sure 
>> I've been told this before but it apparently didn't sink in.)  Is
>> there 
>> also a mellanox-specific backend (that is not ibverbs) that takes any
>> 
>> special advantage of mellanox hw capabilities?
> 
> The actual situation is that xio is currently ibverbs specific, though
> there is interest with Mellanox and some partners in building a TCP
> transport for it.
> 
> What is true is that xio makes very advanced use of ibverbs interfaces,
> lock free/wait-free allocators, rdtsc, but hides a lot of details from
> upper layers.  The xio designers knew how to get the most from infiniband/
> RDMA, and it shows.
> 
> Also, ibverbs is a first-class interface to iWARP and esp.
> ROCE hardware, as well as ib.  I've been doing most of my development on
> a tweaked version of the softiwarp ib provider, which amounts to a full
> RDMA simulator that runs on anything.  (Apparently it can run over TCP,
> but I just use it on one vm host.)
> 
> I haven't worked with cci, but just glancing at it, I'd say xio stacks
> up very well on ibverbs, but won't solve the TCP transport problem
> immediately.

The efforts seem similar, but with slightly different goals.

With CCI, our goal is to provide a vendor-neutral and fabric-neutral, generic communication abstraction layer for any interconnect that we use. Each generation of large HPC machine seems to get a new interface. The various MPI implementations have their own network abstract layers (NAL) so that MPI users do not need to worry about the low-level network interface. MPI, however, is limited to jobs within a single machine and, typically, to within a single job on that single machine. A researcher wanting to develop alternative programming models or services that connect multiple jobs or extend off the compute system have a hard, if not impossible, time using MPI. We currently support Sockets (UDP and TCP), Verbs (IB and ROCE, but probably not iWarp because we use SRQs), Cray GNI (Gemini and A
 ries), and slightly out-of-date Cray Portals3 (SeaStar). We are working on adding shared memory as well as transparent routing between fabrics so that a compute node on one fabric can route to a storage system on another fabric.

I would imagine Mellanox's goal with XIO is to provide a simpler programming model that masks native Verbs and RDMACM for Verbs compatible fabrics (IB, RoCE, and possibly iWarp if they do not use SRQs). It adds an active message-like model as well as access to the underlying messaging layer. The addition of TCP and shared memory makes sense.

Both provide an event-driven model and include the ability to provide notification via traditional OS methods such as epoll() and others.

I am unclear if XIO provides for background progress or if the application must periodically call into XIO to ensure progress.

> 
>> 
>> Similarly, are there other projects or vendors that are looking at xio
>> at 
>> this point?
> 
> Mellanox partners are working with it mainly, I believe.
> 
>> I've seen similar attempts to create this sort of library
>> 
>> (CCI comes to mind: https://github.com/CCI/cci).  Have these previous
>> 
>> attempts influenced the design of xio at all?
> 
>>> 
>>> The approach I took in incorporating Accelio was to build on the key
>> abstractions
>>> of Messenger, Connection, and Dispatcher, and Message, and build a
>> corresponding
>>> family of concrete classes:
>> 
>> This sounds like the right approach.  And we definitely want to clean
>> up 
>> the separation of the abstract interfaces (Message, Connection,
>> Messenger) 
>> from the implementations.  I'm happy to pull that stuff into the tree
>> 
>> quickly once the interfaces appear stable (although it looks like your
>> 
>> branch is based off lots of other linuxbox bits, so it probably isn't
>> 
>> important until this gets closer to ready).
> 
> Ok, cool.
> 
>> 
>> Also, it would be great to build out the simple_* test endpoints as
>> this 
>> effort progresses; hopefully that can eventually form the basis of a
>> test 
>> suite for the messenger and can be expanded to include various stress
>> 
>> tests that don't require a full running cluster.
> 
> I agree.  I intend to have it at least running more Message types
> RSN.
> 
>> 
>>> XioMessenger (concrete, implements Messenger, encapsulates xio
>> endpoints, aggregates
> 
> Agreed, I respond to this point in more detail in my reply to Greg's
> message.
> 
>> 
>> This worries me a bit; see Greg's questions.  There are several 
>> request/reply patterns, but many (most?) of the message exchanges are
>> 
>> asymmetrical.  I wonder if the low-level request/reply model really
>> maps 
>> more closely the 'ack' stuff in SimpleMessenger (as it's about 
>> deallocating the sender's memory and cleaning up rdma state).
>> 
>>> A lot of low level details of the mapping from Message to Accelio 
>>> messaging are currently in flux, but the basic idea is to re-use the
>> 
>>> current encode/decode primitives as far as possible, while eliding
>> the 
>>> acks, sequence # and tids, and timestamp behaviors of Pipe, or
>> rather, 
>>> replacing them with mappings to Accelio primitives.  I have some
>> wrapper 
>>> classes that help with this.  For the moment, the existing Ceph
>> message 
>>> headers and footers are still there, but are now encoded/decoded,
>> rather 
>>> than hand-marshalled.  This means that checksumming is probably
>> mostly 
>>> intact.  Message signatures are not implemented.
>>> 
>>> What works.  The current prototype isn't integrated with the main
>> server daemons
>>> (e.g., OSD) but experimental work on that is in progress.  I've
>> created a pair of
>>> simple standalone client/server applications
>> simple_server/simple_client and
>>> a matching xio_server/xio_client, that provide a minimal message
>> dispatch loop with
>>> a new SimpleDispatcher class and some other helpers, as a way to
>> work with both
>>> messengers side-by-side.  These are currently very primitive, but
>> will probably
>>> do more things soon.  The current prototype sends messages over
>> Accelio, but has some issue
>>> with replies, that should be fixed shortly.  It leaks lots of
>> memory, etc.
>>> 
>>> We've pushed a work-in-progress branch "xio-messenger" to our
>> external github
>>> repository, for community review.  Find it here:
>>> 
>>> https://github.com/linuxbox2/linuxbox-ceph
>> 
>> Looking through this, it occurs to me that there are some other 
>> foundational pieces that we'll need to get in place soon:
>> 
>> - The XioMessenger is a completely different wire protocol that needs
>> to 
>> be distinct from the legacy protocol.  Probably we can use the 
>> entity_addr_t::type field for this.
>> - We'll want the various *Map structures to allow multiple 
>> entity_addr_t's per entity.  We already could use this to support both
>> 
>> IPv4 and IPv6.  In the future, though, we'll probably want clusters
>> that 
>> can speak both the legacy TCP protocol (via SimpleMessenger or some 
>> improved implementation) and the xio one (and whatever else we dream
>> up in 
>> the future).
> 
> Ack.
> 
>> 
>> Also, as has been mentioned previously,
>> 
>> - We need to continue to migrate stuff over to the Connection-based 
>> Messenger interface and off the original methods that take
>> entity_inst_t.  
>> The sticky bit here is the peer-to-peer mode that is used inside the
>> OSD 
>> and MDS clusters: those need to handle racing connection attempts,
>> which 
>> either requires the internal entity name -> connection map to resolve
>> 
>> races (as we have now) or a new approach that pushes the
>> race-resolution 
>> up into the calling code (meh).  No need to address it now, but
>> eventually 
>> we'll need to tackle it before this can be used on the osd back-side 
>> network.
>> 
>> sage
> 
> -- 
> Matt Benjamin
> CohortFS, LLC.
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
> 
> http://cohortfs.com
> 
> tel.  734-761-4689 
> fax.  734-769-8938 
> cel.  734-216-5309 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Ceph Messaging on Accelio (libxio) RDMA
  2014-01-06 15:55     ` Atchley, Scott
@ 2014-01-07 19:52       ` Yaron Haviv
  2014-01-07 20:16         ` Yehuda Sadeh
  2014-01-08 15:54         ` Atchley, Scott
  0 siblings, 2 replies; 17+ messages in thread
From: Yaron Haviv @ 2014-01-07 19:52 UTC (permalink / raw)
  To: Atchley, Scott, Matt W. Benjamin; +Cc: Sage Weil, ceph-devel, Eyal Salomon

Scott, See below 

> -----Original Message-----
> From: Atchley, Scott [mailto:atchleyes@ornl.gov]
> Sent: Monday, January 06, 2014 5:55 PM
> To: Matt W. Benjamin
> Cc: Sage Weil; ceph-devel; Yaron Haviv; Eyal Salomon
> Subject: Re: Ceph Messaging on Accelio (libxio) RDMA
> 
> On Dec 11, 2013, at 8:33 PM, Matt W. Benjamin <matt@cohortfs.com>
> wrote:
> > HI Sage,
> >
> > inline
> >
> > ----- "Sage Weil" <sage@inktank.com> wrote:
> >
> >> Hi Matt,
> >>
> >> Thanks for posting this!  Some comments and questions below.
> >>
> >>
> >> I was originally thinking that xio was going to be more
> >> mellanox-specific, but it looks like it runs over multiple transports
> >> (even tcp!).  (I'm sure I've been told this before but it apparently
> >> didn't sink in.)  Is there also a mellanox-specific backend (that is
> >> not ibverbs) that takes any
> >>
> >> special advantage of mellanox hw capabilities?
> >
> > The actual situation is that xio is currently ibverbs specific, though
> > there is interest with Mellanox and some partners in building a TCP
> > transport for it.
> >
> > What is true is that xio makes very advanced use of ibverbs
> > interfaces, lock free/wait-free allocators, rdtsc, but hides a lot of
> > details from upper layers.  The xio designers knew how to get the most
> > from infiniband/ RDMA, and it shows.
> >
> > Also, ibverbs is a first-class interface to iWARP and esp.
> > ROCE hardware, as well as ib.  I've been doing most of my development
> > on a tweaked version of the softiwarp ib provider, which amounts to a
> > full RDMA simulator that runs on anything.  (Apparently it can run
> > over TCP, but I just use it on one vm host.)
> >
> > I haven't worked with cci, but just glancing at it, I'd say xio stacks
> > up very well on ibverbs, but won't solve the TCP transport problem
> > immediately.
> 
> The efforts seem similar, but with slightly different goals.
> 
> With CCI, our goal is to provide a vendor-neutral and fabric-neutral, generic
> communication abstraction layer for any interconnect that we use. Each
> generation of large HPC machine seems to get a new interface. The various
> MPI implementations have their own network abstract layers (NAL) so that
> MPI users do not need to worry about the low-level network interface. MPI,
> however, is limited to jobs within a single machine and, typically, to within a
> single job on that single machine. A researcher wanting to develop
> alternative programming models or services that connect multiple jobs or
> extend off the compute system have a hard, if not impossible, time using
> MPI. We currently support Sockets (UDP and TCP), Verbs (IB and ROCE, but
> probably not iWarp because we use SRQs), Cray GNI (Gemini and Aries), and
> slightly out-of-date Cray Portals3 (SeaStar). We are working on adding shared
> memory as well as transparent routing between fabrics so that a compute
> node on one fabric can route to a storage system on another fabric.
>
[[YH]] 

Accelio is an open source, vendor and transport neutral, works over any RDMA device, and the TCP transport is progress 
It has different focus than CCI, CCI seems similar to our MXM library used for MPI/SHMEM/PGAS/..
While Accelio is Enterprise messaging and RPC focused, its goal is to maximize performance in a noisy event driven environment
And have end to end transaction reliability, including dealing with extreme failure cases, task cancelation and retransmission, multipath, data-integrity .. 
It has C/C++/Java bindings, and Python in progress 

We have noticed that most of our partners and OpenSource efforts repeat the same mistakes, duplicate a lot of code, and end up with partial functionality 
So we decided to write a common layer that deal with all the new age transport challenges, and taking into account our experience
Can read the details in: http://www.accelio.org/wp-content/themes/pyramid_child/pdf/WP_Accelio_OpenSource_IO_Message_and_RPC_Acceleration_Library.pdf

Accelio is now used by Tier1/2 storage and database vendors, and integrate into few OpenSource Storage/DB/NoSQL/NewSQL projects  
One open example is Hadoop 
    
 
> I would imagine Mellanox's goal with XIO is to provide a simpler programming
> model that masks native Verbs and RDMACM for Verbs compatible fabrics
> (IB, RoCE, and possibly iWarp if they do not use SRQs). It adds an active
> message-like model as well as access to the underlying messaging layer. The
> addition of TCP and shared memory makes sense.
> 
[[YH]] 
As I mentioned the goal is to provide the best fast/reliable messaging layer
It provide many integrated services that are not part of the common Verbs API

> Both provide an event-driven model and include the ability to provide
> notification via traditional OS methods such as epoll() and others.
> 
> I am unclear if XIO provides for background progress or if the application
> must periodically call into XIO to ensure progress.
>
[[YH]] 
Accelio support interrupts and polling, it works well with both and automatically toggles between those when needed to max performance and min CPU overhead 
It's a bit different than the MPI models which make heavy use of polling, given MPI does computation/communication intervals and CPU can allow itself to busy wait  
 
> >
> >>
> >> Similarly, are there other projects or vendors that are looking at
> >> xio at this point?
> >
> > Mellanox partners are working with it mainly, I believe.
> >
> >> I've seen similar attempts to create this sort of library
> >>
> >> (CCI comes to mind: https://github.com/CCI/cci).  Have these previous
> >>
> >> attempts influenced the design of xio at all?
> >
[[YH]] 

The major influence on Accelio design came from storage and DB based RDMA protocols, and lessons we learned (e.g. Accelio is lockless unlike the current ones)
we obviously did quite a bit of review with our MPI and MXM experts 

we would be happy to share with you more details and examples 

> >>>
> >>> The approach I took in incorporating Accelio was to build on the key
> >> abstractions
> >>> of Messenger, Connection, and Dispatcher, and Message, and build a
> >> corresponding
> >>> family of concrete classes:
> >>
> >> This sounds like the right approach.  And we definitely want to clean
> >> up the separation of the abstract interfaces (Message, Connection,
> >> Messenger)
> >> from the implementations.  I'm happy to pull that stuff into the tree
> >>
> >> quickly once the interfaces appear stable (although it looks like
> >> your
> >>
> >> branch is based off lots of other linuxbox bits, so it probably isn't
> >>
> >> important until this gets closer to ready).
> >
> > Ok, cool.
> >
> >>
> >> Also, it would be great to build out the simple_* test endpoints as
> >> this effort progresses; hopefully that can eventually form the basis
> >> of a test suite for the messenger and can be expanded to include
> >> various stress
> >>
> >> tests that don't require a full running cluster.
> >
> > I agree.  I intend to have it at least running more Message types RSN.
> >
> >>
> >>> XioMessenger (concrete, implements Messenger, encapsulates xio
> >> endpoints, aggregates
> >
> > Agreed, I respond to this point in more detail in my reply to Greg's
> > message.
> >
> >>
> >> This worries me a bit; see Greg's questions.  There are several
> >> request/reply patterns, but many (most?) of the message exchanges are
> >>
> >> asymmetrical.  I wonder if the low-level request/reply model really
> >> maps more closely the 'ack' stuff in SimpleMessenger (as it's about
> >> deallocating the sender's memory and cleaning up rdma state).
> >>
> >>> A lot of low level details of the mapping from Message to Accelio
> >>> messaging are currently in flux, but the basic idea is to re-use the
> >>
> >>> current encode/decode primitives as far as possible, while eliding
> >> the
> >>> acks, sequence # and tids, and timestamp behaviors of Pipe, or
> >> rather,
> >>> replacing them with mappings to Accelio primitives.  I have some
> >> wrapper
> >>> classes that help with this.  For the moment, the existing Ceph
> >> message
> >>> headers and footers are still there, but are now encoded/decoded,
> >> rather
> >>> than hand-marshalled.  This means that checksumming is probably
> >> mostly
> >>> intact.  Message signatures are not implemented.
> >>>
> >>> What works.  The current prototype isn't integrated with the main
> >> server daemons
> >>> (e.g., OSD) but experimental work on that is in progress.  I've
> >> created a pair of
> >>> simple standalone client/server applications
> >> simple_server/simple_client and
> >>> a matching xio_server/xio_client, that provide a minimal message
> >> dispatch loop with
> >>> a new SimpleDispatcher class and some other helpers, as a way to
> >> work with both
> >>> messengers side-by-side.  These are currently very primitive, but
> >> will probably
> >>> do more things soon.  The current prototype sends messages over
> >> Accelio, but has some issue
> >>> with replies, that should be fixed shortly.  It leaks lots of
> >> memory, etc.
> >>>
> >>> We've pushed a work-in-progress branch "xio-messenger" to our
> >> external github
> >>> repository, for community review.  Find it here:
> >>>
> >>> https://github.com/linuxbox2/linuxbox-ceph
> >>
> >> Looking through this, it occurs to me that there are some other
> >> foundational pieces that we'll need to get in place soon:
> >>
> >> - The XioMessenger is a completely different wire protocol that needs
> >> to be distinct from the legacy protocol.  Probably we can use the
> >> entity_addr_t::type field for this.
> >> - We'll want the various *Map structures to allow multiple
> >> entity_addr_t's per entity.  We already could use this to support
> >> both
> >>
> >> IPv4 and IPv6.  In the future, though, we'll probably want clusters
> >> that can speak both the legacy TCP protocol (via SimpleMessenger or
> >> some improved implementation) and the xio one (and whatever else we
> >> dream up in the future).
> >
> > Ack.
> >
> >>
> >> Also, as has been mentioned previously,
> >>
> >> - We need to continue to migrate stuff over to the Connection-based
> >> Messenger interface and off the original methods that take
> >> entity_inst_t.
> >> The sticky bit here is the peer-to-peer mode that is used inside the
> >> OSD and MDS clusters: those need to handle racing connection
> >> attempts, which either requires the internal entity name ->
> >> connection map to resolve
> >>
> >> races (as we have now) or a new approach that pushes the
> >> race-resolution up into the calling code (meh).  No need to address
> >> it now, but eventually we'll need to tackle it before this can be
> >> used on the osd back-side network.
> >>
> >> sage
> >
> > --
> > Matt Benjamin
> > CohortFS, LLC.
> > 206 South Fifth Ave. Suite 150
> > Ann Arbor, MI  48104
> >
> > http://cohortfs.com
> >
> > tel.  734-761-4689
> > fax.  734-769-8938
> > cel.  734-216-5309
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2014-01-07 19:52       ` Yaron Haviv
@ 2014-01-07 20:16         ` Yehuda Sadeh
  2014-01-07 20:25           ` Yaron Haviv
  2014-01-08 15:54         ` Atchley, Scott
  1 sibling, 1 reply; 17+ messages in thread
From: Yehuda Sadeh @ 2014-01-07 20:16 UTC (permalink / raw)
  To: Yaron Haviv
  Cc: Atchley, Scott, Matt W. Benjamin, Sage Weil, ceph-devel,
	Eyal Salomon

On Tue, Jan 7, 2014 at 11:52 AM, Yaron Haviv <yaronh@mellanox.com> wrote:
> [[YH]]
>
> Accelio is an open source, vendor and transport neutral, works over any RDMA device, and the TCP transport is progress
> It has different focus than CCI, CCI seems similar to our MXM library used for MPI/SHMEM/PGAS/..
> While Accelio is Enterprise messaging and RPC focused, its goal is to maximize performance in a noisy event driven environment
> And have end to end transaction reliability, including dealing with extreme failure cases, task cancelation and retransmission, multipath, data-integrity ..
> It has C/C++/Java bindings, and Python in progress

I don't know if it has been covered (wasn't following this thread
closely). The Accelio licensing is problematic. It's dual licensed
(GPLv2, modified BSD which requires retaining the copyright clause),
both licenses are compatible with ceph, however, both will add
restrictions to the ceph distribution which we want to avoid.

Yehuda


>
> We have noticed that most of our partners and OpenSource efforts repeat the same mistakes, duplicate a lot of code, and end up with partial functionality
> So we decided to write a common layer that deal with all the new age transport challenges, and taking into account our experience
> Can read the details in: http://www.accelio.org/wp-content/themes/pyramid_child/pdf/WP_Accelio_OpenSource_IO_Message_and_RPC_Acceleration_Library.pdf
>
> Accelio is now used by Tier1/2 storage and database vendors, and integrate into few OpenSource Storage/DB/NoSQL/NewSQL projects
> One open example is Hadoop
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Ceph Messaging on Accelio (libxio) RDMA
  2014-01-07 20:16         ` Yehuda Sadeh
@ 2014-01-07 20:25           ` Yaron Haviv
  2014-01-07 20:32             ` Matt W. Benjamin
  0 siblings, 1 reply; 17+ messages in thread
From: Yaron Haviv @ 2014-01-07 20:25 UTC (permalink / raw)
  To: Yehuda Sadeh
  Cc: Atchley, Scott, Matt W. Benjamin, Sage Weil, ceph-devel,
	Eyal Salomon

> -----Original Message-----
> From: Yehuda Sadeh [mailto:yehuda@inktank.com]
> Sent: Tuesday, January 07, 2014 10:17 PM
> To: Yaron Haviv
> Cc: Atchley, Scott; Matt W. Benjamin; Sage Weil; ceph-devel; Eyal Salomon
> Subject: Re: Ceph Messaging on Accelio (libxio) RDMA
> 
> On Tue, Jan 7, 2014 at 11:52 AM, Yaron Haviv <yaronh@mellanox.com>
> wrote:
> > [[YH]]
> >
> > Accelio is an open source, vendor and transport neutral, works over
> > any RDMA device, and the TCP transport is progress It has different focus
> than CCI, CCI seems similar to our MXM library used for
> MPI/SHMEM/PGAS/..
> > While Accelio is Enterprise messaging and RPC focused, its goal is to
> > maximize performance in a noisy event driven environment And have end
> to end transaction reliability, including dealing with extreme failure cases,
> task cancelation and retransmission, multipath, data-integrity ..
> > It has C/C++/Java bindings, and Python in progress
> 
> I don't know if it has been covered (wasn't following this thread closely). The
> Accelio licensing is problematic. It's dual licensed (GPLv2, modified BSD which
> requires retaining the copyright clause), both licenses are compatible with
> ceph, however, both will add restrictions to the ceph distribution which we
> want to avoid.
> 
> Yehuda
>
[[YH]] 
Yehuda,

Let's take it offline to understand the concerns, the license is identical to the OFED (RDMA Verbs) library AFAIK  
If we need to make modifications to make it less restrictive we are open to it 

Yaron
 
> 
> >
> > We have noticed that most of our partners and OpenSource efforts
> > repeat the same mistakes, duplicate a lot of code, and end up with
> > partial functionality So we decided to write a common layer that deal
> > with all the new age transport challenges, and taking into account our
> > experience Can read the details in:
> > http://www.accelio.org/wp-
> content/themes/pyramid_child/pdf/WP_Accelio_
> > OpenSource_IO_Message_and_RPC_Acceleration_Library.pdf
> >
> > Accelio is now used by Tier1/2 storage and database vendors, and
> > integrate into few OpenSource Storage/DB/NoSQL/NewSQL projects One
> > open example is Hadoop
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2014-01-07 20:25           ` Yaron Haviv
@ 2014-01-07 20:32             ` Matt W. Benjamin
  0 siblings, 0 replies; 17+ messages in thread
From: Matt W. Benjamin @ 2014-01-07 20:32 UTC (permalink / raw)
  To: Yaron Haviv
  Cc: Scott Atchley, Sage Weil, ceph-devel, Eyal Salomon, Yehuda Sadeh

Apparently the only issue is the clause 2 requirement to mention the
copyright in docu or materials?

Matt

----- "Yaron Haviv" <yaronh@mellanox.com> wrote:

> > -----Original Message-----
> > From: Yehuda Sadeh [mailto:yehuda@inktank.com]
> > Sent: Tuesday, January 07, 2014 10:17 PM
> > To: Yaron Haviv
> > Cc: Atchley, Scott; Matt W. Benjamin; Sage Weil; ceph-devel; Eyal
> Salomon
> > Subject: Re: Ceph Messaging on Accelio (libxio) RDMA
> > 
> > On Tue, Jan 7, 2014 at 11:52 AM, Yaron Haviv <yaronh@mellanox.com>
> > wrote:
> > > [[YH]]
> > >
> > > Accelio is an open source, vendor and transport neutral, works
> over
> > > any RDMA device, and the TCP transport is progress It has
> different focus
> > than CCI, CCI seems similar to our MXM library used for
> > MPI/SHMEM/PGAS/..
> > > While Accelio is Enterprise messaging and RPC focused, its goal is
> to
> > > maximize performance in a noisy event driven environment And have
> end
> > to end transaction reliability, including dealing with extreme
> failure cases,
> > task cancelation and retransmission, multipath, data-integrity ..
> > > It has C/C++/Java bindings, and Python in progress
> > 
> > I don't know if it has been covered (wasn't following this thread
> closely). The
> > Accelio licensing is problematic. It's dual licensed (GPLv2,
> modified BSD which
> > requires retaining the copyright clause), both licenses are
> compatible with
> > ceph, however, both will add restrictions to the ceph distribution
> which we
> > want to avoid.
> > 
> > Yehuda
> >
> [[YH]] 
> Yehuda,
> 
> Let's take it offline to understand the concerns, the license is
> identical to the OFED (RDMA Verbs) library AFAIK  
> If we need to make modifications to make it less restrictive we are
> open to it 
> 
> Yaron
>  
> > 
> > >
> > > We have noticed that most of our partners and OpenSource efforts
> > > repeat the same mistakes, duplicate a lot of code, and end up
> with
> > > partial functionality So we decided to write a common layer that
> deal
> > > with all the new age transport challenges, and taking into account
> our
> > > experience Can read the details in:
> > > http://www.accelio.org/wp-
> > content/themes/pyramid_child/pdf/WP_Accelio_
> > > OpenSource_IO_Message_and_RPC_Acceleration_Library.pdf
> > >
> > > Accelio is now used by Tier1/2 storage and database vendors, and
> > > integrate into few OpenSource Storage/DB/NoSQL/NewSQL projects
> One
> > > open example is Hadoop
> > >

-- 
Matt Benjamin
CohortFS, LLC.
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Ceph Messaging on Accelio (libxio) RDMA
  2014-01-07 19:52       ` Yaron Haviv
  2014-01-07 20:16         ` Yehuda Sadeh
@ 2014-01-08 15:54         ` Atchley, Scott
  2014-01-08 17:31           ` Yaron Haviv
  1 sibling, 1 reply; 17+ messages in thread
From: Atchley, Scott @ 2014-01-08 15:54 UTC (permalink / raw)
  To: Yaron Haviv; +Cc: Matt W. Benjamin, Sage Weil, ceph-devel, Eyal Salomon

On Jan 7, 2014, at 2:52 PM, Yaron Haviv <yaronh@mellanox.com> wrote:

> Scott, See below
> 
>> -----Original Message-----
>> From: Atchley, Scott [mailto:atchleyes@ornl.gov]
>> Sent: Monday, January 06, 2014 5:55 PM
>> To: Matt W. Benjamin
>> Cc: Sage Weil; ceph-devel; Yaron Haviv; Eyal Salomon
>> Subject: Re: Ceph Messaging on Accelio (libxio) RDMA
>> 
>> On Dec 11, 2013, at 8:33 PM, Matt W. Benjamin <matt@cohortfs.com>
>> wrote:
>>> HI Sage,
>>> 
>>> inline
>>> 
>>> ----- "Sage Weil" <sage@inktank.com> wrote:
>>> 
>>>> Hi Matt,
>>>> 
>>>> Thanks for posting this!  Some comments and questions below.
>>>> 
>>>> 
>>>> I was originally thinking that xio was going to be more
>>>> mellanox-specific, but it looks like it runs over multiple transports
>>>> (even tcp!).  (I'm sure I've been told this before but it apparently
>>>> didn't sink in.)  Is there also a mellanox-specific backend (that is
>>>> not ibverbs) that takes any
>>>> 
>>>> special advantage of mellanox hw capabilities?
>>> 
>>> The actual situation is that xio is currently ibverbs specific, though
>>> there is interest with Mellanox and some partners in building a TCP
>>> transport for it.
>>> 
>>> What is true is that xio makes very advanced use of ibverbs
>>> interfaces, lock free/wait-free allocators, rdtsc, but hides a lot of
>>> details from upper layers.  The xio designers knew how to get the most
>>> from infiniband/ RDMA, and it shows.
>>> 
>>> Also, ibverbs is a first-class interface to iWARP and esp.
>>> ROCE hardware, as well as ib.  I've been doing most of my development
>>> on a tweaked version of the softiwarp ib provider, which amounts to a
>>> full RDMA simulator that runs on anything.  (Apparently it can run
>>> over TCP, but I just use it on one vm host.)
>>> 
>>> I haven't worked with cci, but just glancing at it, I'd say xio stacks
>>> up very well on ibverbs, but won't solve the TCP transport problem
>>> immediately.
>> 
>> The efforts seem similar, but with slightly different goals.
>> 
>> With CCI, our goal is to provide a vendor-neutral and fabric-neutral, generic
>> communication abstraction layer for any interconnect that we use. Each
>> generation of large HPC machine seems to get a new interface. The various
>> MPI implementations have their own network abstract layers (NAL) so that
>> MPI users do not need to worry about the low-level network interface. MPI,
>> however, is limited to jobs within a single machine and, typically, to within a
>> single job on that single machine. A researcher wanting to develop
>> alternative programming models or services that connect multiple jobs or
>> extend off the compute system have a hard, if not impossible, time using
>> MPI. We currently support Sockets (UDP and TCP), Verbs (IB and ROCE, but
>> probably not iWarp because we use SRQs), Cray GNI (Gemini and Aries), and
>> slightly out-of-date Cray Portals3 (SeaStar). We are working on adding shared
>> memory as well as transparent routing between fabrics so that a compute
>> node on one fabric can route to a storage system on another fabric.
>> 
> [[YH]]
> 
> Accelio is an open source, vendor and transport neutral, works over any RDMA device, and the TCP transport is progress
> It has different focus than CCI, CCI seems similar to our MXM library used for MPI/SHMEM/PGAS/..

I thought MXM was a tag matching interface more similar to PSM or MX, no?

> While Accelio is Enterprise messaging and RPC focused, its goal is to maximize performance in a noisy event driven environment
> And have end to end transaction reliability, including dealing with extreme failure cases, task cancelation and retransmission, multipath, data-integrity ..

Interesting.

> It has C/C++/Java bindings, and Python in progress
> 
> We have noticed that most of our partners and OpenSource efforts repeat the same mistakes, duplicate a lot of code, and end up with partial functionality
> So we decided to write a common layer that deal with all the new age transport challenges, and taking into account our experience
> Can read the details in: http://www.accelio.org/wp-content/themes/pyramid_child/pdf/WP_Accelio_OpenSource_IO_Message_and_RPC_Acceleration_Library.pdf

Nice overview. Like many white papers, it oversells what is available today (e.g. includes shmem and TCP). ;-)

The Send/Receive interface seems very much like CCI. I can see where building Request/Response on top is trivial. One big difference I can see if the XIO session/connection versus CCI's endpoint/connection. XIO allows polling per connection while CCI only allows polling per endpoint (by one or more threads).

XIO claims to provide multi-pathing. Does XIO allow for reliable, but out-of-order delivery of messages that can happen with multi-pathing? If not, how does it guarantee order? Buffering on the receiver?

> Accelio is now used by Tier1/2 storage and database vendors, and integrate into few OpenSource Storage/DB/NoSQL/NewSQL projects
> One open example is Hadoop
> 
> 
>> I would imagine Mellanox's goal with XIO is to provide a simpler programming
>> model that masks native Verbs and RDMACM for Verbs compatible fabrics
>> (IB, RoCE, and possibly iWarp if they do not use SRQs). It adds an active
>> message-like model as well as access to the underlying messaging layer. The
>> addition of TCP and shared memory makes sense.
>> 
> [[YH]]
> As I mentioned the goal is to provide the best fast/reliable messaging layer
> It provide many integrated services that are not part of the common Verbs API
> 
>> Both provide an event-driven model and include the ability to provide
>> notification via traditional OS methods such as epoll() and others.
>> 
>> I am unclear if XIO provides for background progress or if the application
>> must periodically call into XIO to ensure progress.
>> 
> [[YH]]
> Accelio support interrupts and polling, it works well with both and automatically toggles between those when needed to max performance and min CPU overhead
> It's a bit different than the MPI models which make heavy use of polling, given MPI does computation/communication intervals and CPU can allow itself to busy wait

If an application is using the soon-to-be written TCP transport, does XIO progress the server side automatically or does the application have to call into XIO calls to ensure progress? I imagine when using Verbs-supported hardware, the answer is that it is automatic. I would expect that with TCP that the application must call into XIO or that XIO uses a background thread for TCP. Or is the design still being hashed out?

Same question for shared memory.

>>>> Similarly, are there other projects or vendors that are looking at
>>>> xio at this point?
>>> 
>>> Mellanox partners are working with it mainly, I believe.
>>> 
>>>> I've seen similar attempts to create this sort of library
>>>> 
>>>> (CCI comes to mind: https://github.com/CCI/cci).  Have these previous
>>>> 
>>>> attempts influenced the design of xio at all?
>>> 
> [[YH]]
> 
> The major influence on Accelio design came from storage and DB based RDMA protocols, and lessons we learned (e.g. Accelio is lockless unlike the current ones)
> we obviously did quite a bit of review with our MPI and MXM experts
> 
> we would be happy to share with you more details and examples

Absolutely. I would appreciate more information and we can take it offline. I am interested in knowing what is needed to add a transport to XIO. I need more details about internal resource usage and scaling issues.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Ceph Messaging on Accelio (libxio) RDMA
  2014-01-08 15:54         ` Atchley, Scott
@ 2014-01-08 17:31           ` Yaron Haviv
  0 siblings, 0 replies; 17+ messages in thread
From: Yaron Haviv @ 2014-01-08 17:31 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Matt W. Benjamin, Sage Weil, ceph-devel, Eyal Salomon

> -----Original Message-----
> From: Atchley, Scott [mailto:atchleyes@ornl.gov]
> Sent: Wednesday, January 08, 2014 5:54 PM
> To: Yaron Haviv
> Cc: Matt W. Benjamin; Sage Weil; ceph-devel; Eyal Salomon
> Subject: Re: Ceph Messaging on Accelio (libxio) RDMA
> 
> On Jan 7, 2014, at 2:52 PM, Yaron Haviv <yaronh@mellanox.com> wrote:
> 
> > Scott, See below
> >
> >> -----Original Message-----
> >> From: Atchley, Scott [mailto:atchleyes@ornl.gov]
> >> Sent: Monday, January 06, 2014 5:55 PM
> >> To: Matt W. Benjamin
> >> Cc: Sage Weil; ceph-devel; Yaron Haviv; Eyal Salomon
> >> Subject: Re: Ceph Messaging on Accelio (libxio) RDMA
> >>
> >> On Dec 11, 2013, at 8:33 PM, Matt W. Benjamin <matt@cohortfs.com>
> >> wrote:
> >>> HI Sage,
> >>>
> >>> inline
> >>>
> >>> ----- "Sage Weil" <sage@inktank.com> wrote:
> >>>
> >>>> Hi Matt,
> >>>>
> >>>> Thanks for posting this!  Some comments and questions below.
> >>>>
> >>>>
> >>>> I was originally thinking that xio was going to be more
> >>>> mellanox-specific, but it looks like it runs over multiple
> >>>> transports (even tcp!).  (I'm sure I've been told this before but
> >>>> it apparently didn't sink in.)  Is there also a mellanox-specific
> >>>> backend (that is not ibverbs) that takes any
> >>>>
> >>>> special advantage of mellanox hw capabilities?
> >>>
> >>> The actual situation is that xio is currently ibverbs specific,
> >>> though there is interest with Mellanox and some partners in building
> >>> a TCP transport for it.
> >>>
> >>> What is true is that xio makes very advanced use of ibverbs
> >>> interfaces, lock free/wait-free allocators, rdtsc, but hides a lot
> >>> of details from upper layers.  The xio designers knew how to get the
> >>> most from infiniband/ RDMA, and it shows.
> >>>
> >>> Also, ibverbs is a first-class interface to iWARP and esp.
> >>> ROCE hardware, as well as ib.  I've been doing most of my
> >>> development on a tweaked version of the softiwarp ib provider, which
> >>> amounts to a full RDMA simulator that runs on anything.  (Apparently
> >>> it can run over TCP, but I just use it on one vm host.)
> >>>
> >>> I haven't worked with cci, but just glancing at it, I'd say xio
> >>> stacks up very well on ibverbs, but won't solve the TCP transport
> >>> problem immediately.
> >>
> >> The efforts seem similar, but with slightly different goals.
> >>
> >> With CCI, our goal is to provide a vendor-neutral and fabric-neutral,
> >> generic communication abstraction layer for any interconnect that we
> >> use. Each generation of large HPC machine seems to get a new
> >> interface. The various MPI implementations have their own network
> >> abstract layers (NAL) so that MPI users do not need to worry about
> >> the low-level network interface. MPI, however, is limited to jobs
> >> within a single machine and, typically, to within a single job on
> >> that single machine. A researcher wanting to develop alternative
> >> programming models or services that connect multiple jobs or extend
> >> off the compute system have a hard, if not impossible, time using
> >> MPI. We currently support Sockets (UDP and TCP), Verbs (IB and ROCE,
> >> but probably not iWarp because we use SRQs), Cray GNI (Gemini and
> >> Aries), and slightly out-of-date Cray Portals3 (SeaStar). We are
> >> working on adding shared memory as well as transparent routing
> between fabrics so that a compute node on one fabric can route to a storage
> system on another fabric.
> >>
> > [[YH]]
> >
> > Accelio is an open source, vendor and transport neutral, works over
> > any RDMA device, and the TCP transport is progress It has different focus
> than CCI, CCI seems similar to our MXM library used for
> MPI/SHMEM/PGAS/..
> 
> I thought MXM was a tag matching interface more similar to PSM or MX, no?
> 
[YH] 
MXM stands for Mellanox Messaging Service, and is integrated into things like OpenMPI and OFED
 
> > While Accelio is Enterprise messaging and RPC focused, its goal is to
> > maximize performance in a noisy event driven environment And have end
> to end transaction reliability, including dealing with extreme failure cases,
> task cancelation and retransmission, multipath, data-integrity ..
> 
> Interesting.
> 
> > It has C/C++/Java bindings, and Python in progress
> >
> > We have noticed that most of our partners and OpenSource efforts
> > repeat the same mistakes, duplicate a lot of code, and end up with
> > partial functionality So we decided to write a common layer that deal
> > with all the new age transport challenges, and taking into account our
> > experience Can read the details in:
> > http://www.accelio.org/wp-
> content/themes/pyramid_child/pdf/WP_Accelio_
> > OpenSource_IO_Message_and_RPC_Acceleration_Library.pdf
> 
> Nice overview. Like many white papers, it oversells what is available today
> (e.g. includes shmem and TCP). ;-)
> 
[YH]
the paper is very much aligned with the current functionality, with exception of extra transports  
TCP is already under development and I hope will be uploaded in few weeks, for more functionality we are seeking help from the community as any other OpenSource project  
You can download the code and try out the examples, it's pretty stable (runs daily regression, and used by a bunch of vendors), V1 GA is 3 weeks from now 
e.g. with the R-AIO file I/O example and standard fio benchmark you will get 2-2.5M IOPs, and <10us access-time to a remote /dev/ram   (i.e. 10x faster than anything else I know)
interestingly the Java version gets 99% of the C performance, i.e. Millions of TP/s, main reason is the fact the CPU cores don't wonder around or lock and the transport is in HW  

> The Send/Receive interface seems very much like CCI. I can see where
> building Request/Response on top is trivial. 
[YH] 
Note sure reliable and Async Req/Rep with all the associated task management and races is trivial :) 
We spend quite a bit of time on that, old transports like ZMQ still didn't get there (e.g. no Async Rep, don't deal w Multi-path, ..) 

One big difference I can see if
> the XIO session/connection versus CCI's endpoint/connection. XIO allows
> polling per connection while CCI only allows polling per endpoint (by one or
> more threads).
> 
[YH] 
Accelio is lockless, and allocate dedicated resources per CPU thread (context), even mem allocations are done smart from the nearest Numa banks to avoid coherency overhead  
By default you don't need to poll and batch, you get call-backs with optional hints, we decide the strategy based on various aspects automagically 
You can poll explicitly either per I/O (e.g. in case u want to wait for result, just pass the time as argument in the call) or per context/event-loop (e.g. tell the context that  it can poll x us before arming the interrupts to avoid hysteresis)  
Polling is done per context/thread (not per connection), and aggregate all the connections in that thread, we work on extension to poll on other resources as well in the same context (e.g. libaio for disk) 
It has automatic mechanism to avoid starvation and serialization issues, and amortize OS/HW calls whenever possible 
  
> XIO claims to provide multi-pathing. Does XIO allow for reliable, but out-of-
> order delivery of messages that can happen with multi-pathing? If not, how
> does it guarantee order? Buffering on the receiver?
> 
[YH] 
Accelio tag each message with a seq/tid number, and server side re-order the messages (just move pointers, no copy) 
In case of failures Accelio will re-send from the last accepted ID, client free mem buffer only when response arrives (or Ack in case of sends) 
Accelio support resource load-balancing and session/connection redirect (e.g. like iSCSI redirect) which allow distributing client sessions across multiple ports, threads, local or cluster end-points, and its  transparent to the client operation
for single connection we have initial version of Active/Passive, Active/Active will be addressed in next ver

> > Accelio is now used by Tier1/2 storage and database vendors, and
> > integrate into few OpenSource Storage/DB/NoSQL/NewSQL projects One
> > open example is Hadoop
> >
> >
> >> I would imagine Mellanox's goal with XIO is to provide a simpler
> >> programming model that masks native Verbs and RDMACM for Verbs
> >> compatible fabrics (IB, RoCE, and possibly iWarp if they do not use
> >> SRQs). It adds an active message-like model as well as access to the
> >> underlying messaging layer. The addition of TCP and shared memory
> makes sense.
> >>
> > [[YH]]
> > As I mentioned the goal is to provide the best fast/reliable messaging
> > layer It provide many integrated services that are not part of the
> > common Verbs API
> >
> >> Both provide an event-driven model and include the ability to provide
> >> notification via traditional OS methods such as epoll() and others.
> >>
> >> I am unclear if XIO provides for background progress or if the
> >> application must periodically call into XIO to ensure progress.
> >>
> > [[YH]]
> > Accelio support interrupts and polling, it works well with both and
> > automatically toggles between those when needed to max performance
> and
> > min CPU overhead It's a bit different than the MPI models which make
> > heavy use of polling, given MPI does computation/communication
> > intervals and CPU can allow itself to busy wait
> 
> If an application is using the soon-to-be written TCP transport, does XIO
> progress the server side automatically or does the application have to call
> into XIO calls to ensure progress? I imagine when using Verbs-supported
> hardware, the answer is that it is automatic. I would expect that with TCP
> that the application must call into XIO or that XIO uses a background thread
> for TCP. Or is the design still being hashed out?
> 
> Same question for shared memory.
> 
[YH] 
Application doesn't change if you change transport 
It just registers a call-back, that will get notified once req or rep message arrived and/or got accepted by the peer (we have an optional barrier/ack)
The App model is simple send a bunch or requests, and get called back when the answer arrived, call-back provide all the message details/context so you can process it asynchronously 
Its explained in the WP in more details (and it all works in reality :) )  

> >>>> Similarly, are there other projects or vendors that are looking at
> >>>> xio at this point?
> >>>
> >>> Mellanox partners are working with it mainly, I believe.
> >>>
> >>>> I've seen similar attempts to create this sort of library
> >>>>
> >>>> (CCI comes to mind: https://github.com/CCI/cci).  Have these
> >>>> previous
> >>>>
> >>>> attempts influenced the design of xio at all?
> >>>
> > [[YH]]
> >
> > The major influence on Accelio design came from storage and DB based
> > RDMA protocols, and lessons we learned (e.g. Accelio is lockless
> > unlike the current ones) we obviously did quite a bit of review with
> > our MPI and MXM experts
> >
> > we would be happy to share with you more details and examples
> 
> Absolutely. I would appreciate more information and we can take it offline. I
> am interested in knowing what is needed to add a transport to XIO. I need
> more details about internal resource usage and scaling issues.
[YH] 
Sure, we can arrange a call and explain the details, would be happy to see people writing more transports and add features
Can start with the transport h file  
e.g. one of our partners is planning to do a PCIe transport, we have discussions with some KVM Guru's on doing a dedicated Virtio transport to max perf in para-virtualized mode, .. 

Yaron

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-01-08 17:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-11 22:32 Ceph Messaging on Accelio (libxio) RDMA Matt W. Benjamin
2013-12-11 22:58 ` Gregory Farnum
2013-12-12  1:13   ` Matt W. Benjamin
2013-12-18 22:13     ` Gregory Farnum
2013-12-12  0:59 ` Sage Weil
2013-12-12  1:33   ` Matt W. Benjamin
2013-12-12 11:32     ` Yaron Haviv
2014-01-06 15:55     ` Atchley, Scott
2014-01-07 19:52       ` Yaron Haviv
2014-01-07 20:16         ` Yehuda Sadeh
2014-01-07 20:25           ` Yaron Haviv
2014-01-07 20:32             ` Matt W. Benjamin
2014-01-08 15:54         ` Atchley, Scott
2014-01-08 17:31           ` Yaron Haviv
2013-12-12  2:14   ` Mark Nelson
2013-12-12 10:19 ` Kasper Dieter
2013-12-12 10:43   ` Yaron Haviv

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.