From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nivedita Singhvi Subject: Re: Re: Interdomain comms Date: Fri, 06 May 2005 09:57:02 -0700 Message-ID: <427BA1DE.1030805@us.ibm.com> References: <0BAE938A1E68534E928747B9B46A759A6CF3AC@EXCNYSM0A1AH.nysemail.nyenet> <1115325448.12082.79.camel@localhost> <427B20B9.1010101@hp.com> <1115381693.18929.159.camel@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1115381693.18929.159.camel@localhost> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Harry Butterworth Cc: Mike Wray , xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Harry Butterworth wrote: > On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote: Harry, thanks for bringing this to xen-devel discussion.. > The inter-domain communication API should preserve the efficiency of > these primitives but provide a higher level API which is more convenient > to use. We certainly need to simplify the API for the frontends - making it easier to add frontends for new devices and OSs. We also need to build in support for frontend - frontend communication in an efficient way. > So, I think we're looking for a higher-level API which can preserve the > current efficient implementation for domains resident on the same > physical machine but allows for domains to be separated by a network > interface without having to rewrite all the drivers. > > The API needs to address the following issues: > > Resource discovery --- Discovering the targets of IDC is an inherent > requirement. > > Dynamic behaviour --- Domains are going to come and go all the time. > > Stale communications --- When domains come and go, client protocols must > have a way to recover from communications in flight or potentially in > flight from before the last transition. > > Deadlock --- IDC is a shared resource and must not introduce resource > deadlock issues, for example when FE and BEs are arranged symetrically > in reverse across the same interface or when BEs are stacked and so > introduce chains of dependencies. > > Security --- There are varying degrees of trust beween the domains. > > Ease of use --- This is important for developer productivity and also to > help ensure the other goals (security/robustness) are actually met. > > Efficiency/Performance --- obviously. > > I'd need a few days (which I don't have right now) to put together a > coherent proposal tailored specifically to xen. However, it would > probably be along the lines of the following: > > A buffer abstraction to decouple the IDC API from the memory management > implementation: > > struct local_buffer_reference; > > An endpoint abstraction to represent one end of an IDC connection. It's > important that this is done on a per connection basis rather than having > one per domain for all IDC activity because it avoids deadlock issues > arising from chained, dependent communication. > > struct idc_endpoint; > > A message abstraction because some protocols are more efficiently > implemented using one-way messages than request-response pairs, > particularly when the protocol involves more than two parties. > > struct idc_message > { > ... > struct local_buffer_reference message_body; > }; > > /* When a received message is finished with */ > > void idc_message_complete( struct idc_message * message ); > > A request-response transaction abstraction because most protocols are > more easily implemented with these. > > struct idc_transaction > { > ... > struct local_buffer_reference transaction_parameters; > struct local_buffer_reference transaction_status; > }; > > /* Useful to have an error code in addition to status. */ > > /* When a received transaction is finished with. */ > > void idc_transaction_complete > ( struct idc_transaction * transaction, error_code error ); > > /* When an initiated transaction completes. Error code also reports > transport errors when endpoint disconnects whilst transaction is > outstanding. */ > > error_code idc_transaction_query_error_code > ( struct idc_transaction * transaction ); > > An IDC address abstraction: > > struct idc_address; > > A mechanism to initiate connection establishment, can't fail because > endpoint resource is pre-allocated and create doesn't actually need to > establish the connection. > > The endpoint calls the registered notification functions as follows: > > 'appear' when the remote endpoint is discovered then 'disappear' if it > goes away again or 'connect' if a connection is actually established. > > After 'connect', the client can submit messages and transactions. > > 'disconnect' when the connection is failing, the client must wait for > outstanding messages and transactions to complete (sucessfully or with a > transport error) before completing the disconnect callback and must > flush received messages and transactions whilst disconnected. > > Then 'connect' if the connection is reestablished or 'disappear' if the > remote endpoint has gone away. > > A disconnect, connect cycle guarantees that the remote endpoint also > goes through a disconnect, connect cycle. > > This API allows multi-pathing clients to make intelligent decisions and > provides sufficient guarantees about stale messages and transactions to > make a useful foundation. > > void idc_endpoint_create > ( > struct idc_endpoint * endpoint, > struct idc_address address, > void ( * appear )( struct idc_endpoint * endpoint ), > void ( * connect )( struct idc_endpoint * endpoint ), > void ( * disconnect ) > ( struct idc_endpoint * endpoint, struct callback * callback ), > void ( * disappear )( struct idc_endpoint * endpoint ), > void ( * handle_message ) > ( struct idc_endpoint * endpoint, struct idc_message * message ), > void ( * handle_transaction ) > ( > struct idc_endpoint * endpoint, > struct idc_transaction * transaction > ) > ); > > void idc_endpoint_submit_message > ( struct idc_endpoint * endpoint, struct idc_message * message ); > > void idc_endpoint_submit_transaction > ( struct idc_endpoint * endpoint, struct idc_transaction * > transaction ); > > idc_endpoint_destroy completes the callback once the endpoint has > 'disconnected' and 'disappeared' and the endpoint resource is free for > reuse for a different connection. > > void idc_endpoint_destroy > ( > struct idc_endpoint * endpoint, > struct callback * callback > ); > > The messages and transaction parameters and status must be of finite > length (these quota properties might be parameters of the endpoint > resource allocation). Need a mechanism for efficient, arbitrary length > bulk transfer too. > > An abstraction for buffers owned by remote domains: > > struct remote_buffer_reference; > > Can register a local buffer with the IDC to get a remote buffer > reference: > > struct remote_buffer_reference idc_register_buffer > ( struct local_buffer_reference buffer, some kind of resource probably > required here ); > > remote buffer references may be passed between domains in idc messages > or transaction parameters or transaction status. > > remote buffer references may be forwarded between domains and are usable > from any domain. > > Once in posession of a remote buffer reference, a domain can transfer > data between the remote buffer and a local buffer: > > void idc_send_to_remote_buffer > ( > struct remote_buffer_reference remote_buffer, > struct local_buffer_reference local_buffer, > struct callback * callback, /* transfer completes asynchronously */ > some kind of resource required here > ); > > void idc_receive_from_remote_buffer > ( > struct remote_buffer_reference remote_buffer, > struct local_buffer_reference local_buffer, > struct callback * callback, /* Again, completes asynchronously */ > some kind of resource required here > ); > > Can unregister to free a local buffer independent of remote buffer > references still knocking around in remote domains (subsequent > sends/receives fail): > > void idc_unregister_buffer > ( probably a pointer to the resource passed on registration ); > > So, the 1000 statements of establishment code in the current drivers > becomes: > > Receive an idc address from somewhere (resource discovery is outside the > scope of this sketch). > > Allocate an IDC endpoint from somewhere (resource management is again > outside the scope of this sketch). > > Call idc_endpoint_create. > > Wait for 'connect' before attempting to use connection for device > specific protocol implemented using messages/transactions/remote buffer > references. > > Call idc_endpoint_destroy and quiesce before unloading module. quiesce across remote nodes as well? > The implementation of the local buffer references and memory management > can hide the use of pages which are shared between domains and reference > counted to provide a zero copy implementation of bulk data transfer and > shared page-caches. > > I implemented something very similar to this before for a cluster > interconnect and it worked very nicely. There are some subtleties to > get right about the remote buffer reference implementation and the > implications for out-of-order and idempotent bulk data transfers. All the above looked very sane. How does stuff get out of order, though? We have effectively per-device queues. > As I said, it would require a few more days work to nail down a good > API. thanks, Nivedita