WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Re: Interdomain comms

I like it.  To start with, local communication only would be fine.  Eventually 
it would scale neatly to things like remote device access.

I particularly like the abstraction for remote memory - this would be an 
excellent fit to take advantage of RDMA where available (e.g. a cluster 
running on an IB fabric).

Cheers,
Mark

On Friday 06 May 2005 13:14, Harry Butterworth wrote:
> On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:
> > Harry Butterworth wrote:
> > > The current overhead in terms of client code to establish an entity on
> > > the xen inter-domain communication "bus" is currently of the order of
> > > 1000 statements (counting FE, BE and slice of xend).  A better
> > > inter-domain communication API could reduce this to fewer than 10
> > > statements.  If it's not done by the time I finish the USB work, I will
> > > hopefully be allowed to help with this.
> >
> > This reminded me you had suggested a different model for inter-domain
> > comms. I recently suggested a more socket-like API but it didn't go down
> > well.
>
> What exactly were the issues with the socket-like proposal?
>
> > I agree with you that the event channel model could be improved -
> > what kind of comms model do you suggest?
>
> The event-channel and shared memory page are fine as low-level
> primitives to implement a comms channel between domains on the same
> physical machine. The problem is that the primitives are unnecessarily
> low-level from the client's perspective and result in too much
> per-client code.
>
> The inter-domain communication API should preserve the efficiency of
> these primitives but provide a higher level API which is more convenient
> to use.
>
> Another issue with the current API is that, in the future, it is likely
> (for a number of virtual-iron/fault-tolerant-virtual-machine-like
> reasons) that it will be useful for the inter-domain communication API
> to span physical nodes in a cluster. The problem with the current API is
> that it directly couples the clients to a shared memory implementation
> with a direct connection between the front and back end domains and the
> clients would all need to be rewritten if the implementation was to span
> physical machines or require indirection. Eventually I would expect the
> effort invested in the clients of the inter-domain API to equal or
> exceed the effort invested in the hypervisor in the same way that the
> linux device drivers make up the bulk of the linux kernel code. There is
> a risk therefore that this might become a significant architectural
> limitation.
>
> So, I think we're looking for a higher-level API which can preserve the
> current efficient implementation for domains resident on the same
> physical machine but allows for domains to be separated by a network
> interface without having to rewrite all the drivers.
>
> The API needs to address the following issues:
>
> Resource discovery --- Discovering the targets of IDC is an inherent
> requirement.
>
> Dynamic behaviour --- Domains are going to come and go all the time.
>
> Stale communications --- When domains come and go, client protocols must
> have a way to recover from communications in flight or potentially in
> flight from before the last transition.
>
> Deadlock --- IDC is a shared resource and must not introduce resource
> deadlock issues, for example when FE and BEs are arranged symetrically
> in reverse across the same interface or when BEs are stacked and so
> introduce chains of dependencies.
>
> Security --- There are varying degrees of trust beween the domains.
>
> Ease of use --- This is important for developer productivity and also to
> help ensure the other goals (security/robustness) are actually met.
>
> Efficiency/Performance --- obviously.
>
> I'd need a few days (which I don't have right now) to put together a
> coherent proposal tailored specifically to xen.  However, it would
> probably be along the lines of the following:
>
> A buffer abstraction to decouple the IDC API from the memory management
> implementation:
>
> struct local_buffer_reference;
>
> An endpoint abstraction to represent one end of an IDC connection.  It's
> important that this is done on a per connection basis rather than having
> one per domain for all IDC activity because it avoids deadlock issues
> arising from chained, dependent communication.
>
> struct idc_endpoint;
>
> A message abstraction because some protocols are more efficiently
> implemented using one-way messages than request-response pairs,
> particularly when the protocol involves more than two parties.
>
> struct idc_message
> {
>     ...
>     struct local_buffer_reference message_body;
> };
>
> /* When a received message is finished with */
>
> void idc_message_complete( struct idc_message * message );
>
> A request-response transaction abstraction because most protocols are
> more easily implemented with these.
>
> struct idc_transaction
> {
>     ...
>     struct local_buffer_reference transaction_parameters;
>     struct local_buffer_reference transaction_status;
> };
>
> /* Useful to have an error code in addition to status.  */
>
> /* When a received transaction is finished with. */
>
> void idc_transaction_complete
>   ( struct idc_transaction * transaction, error_code error );
>
> /* When an initiated transaction completes. Error code also reports
> transport errors when endpoint disconnects whilst transaction is
> outstanding. */
>
> error_code idc_transaction_query_error_code
>   ( struct idc_transaction * transaction );
>
> An IDC address abstraction:
>
> struct idc_address;
>
> A mechanism to initiate connection establishment, can't fail because
> endpoint resource is pre-allocated and create doesn't actually need to
> establish the connection.
>
> The endpoint calls the registered notification functions as follows:
>
> 'appear' when the remote endpoint is discovered then 'disappear' if it
> goes away again or 'connect' if a connection is actually established.
>
> After 'connect', the client can submit messages and transactions.
>
> 'disconnect' when the connection is failing, the client must wait for
> outstanding messages and transactions to complete (sucessfully or with a
> transport error) before completing the disconnect callback and must
> flush received messages and transactions whilst disconnected.
>
> Then 'connect' if the connection is reestablished or 'disappear' if the
> remote endpoint has gone away.
>
> A disconnect, connect cycle guarantees that the remote endpoint also
> goes through a disconnect, connect cycle.
>
> This API allows multi-pathing clients to make intelligent decisions and
> provides sufficient guarantees about stale messages and transactions to
> make a useful foundation.
>
> void idc_endpoint_create
> (
>     struct idc_endpoint * endpoint,
>     struct idc_address address,
>     void ( * appear     )( struct idc_endpoint * endpoint ),
>     void ( * connect    )( struct idc_endpoint * endpoint ),
>     void ( * disconnect )
>       ( struct idc_endpoint * endpoint, struct callback * callback ),
>     void ( * disappear )( struct idc_endpoint * endpoint ),
>     void ( * handle_message )
>       ( struct idc_endpoint * endpoint, struct idc_message * message ),
>     void ( * handle_transaction )
>     (
>         struct idc_endpoint * endpoint,
>         struct idc_transaction * transaction
>     )
> );
>
> void idc_endpoint_submit_message
>   ( struct idc_endpoint * endpoint, struct idc_message * message );
>
> void idc_endpoint_submit_transaction
>   ( struct idc_endpoint * endpoint, struct idc_transaction *
> transaction );
>
> idc_endpoint_destroy completes the callback once the endpoint has
> 'disconnected' and 'disappeared' and the endpoint resource is free for
> reuse for a different connection.
>
> void idc_endpoint_destroy
> (
>     struct idc_endpoint * endpoint,
>     struct callback * callback
> );
>
> The messages and transaction parameters and status must be of finite
> length (these quota properties might be parameters of the endpoint
> resource allocation). Need a mechanism for efficient, arbitrary length
> bulk transfer too.
>
> An abstraction for buffers owned by remote domains:
>
> struct remote_buffer_reference;
>
> Can register a local buffer with the IDC to get a remote buffer
> reference:
>
> struct remote_buffer_reference idc_register_buffer
>   ( struct local_buffer_reference buffer, some kind of resource probably
> required here );
>
> remote buffer references may be passed between domains in idc messages
> or transaction parameters or transaction status.
>
> remote buffer references may be forwarded between domains and are usable
> from any domain.
>
> Once in posession of a remote buffer reference, a domain can transfer
> data between the remote buffer and a local buffer:
>
> void idc_send_to_remote_buffer
> (
>     struct remote_buffer_reference remote_buffer,
>     struct local_buffer_reference local_buffer,
>     struct callback * callback, /* transfer completes asynchronously */
>     some kind of resource required here
> );
>
> void idc_receive_from_remote_buffer
> (
>     struct remote_buffer_reference remote_buffer,
>     struct local_buffer_reference local_buffer,
>     struct callback * callback, /* Again, completes asynchronously */
>     some kind of resource required here
> );
>
> Can unregister to free a local buffer independent of remote buffer
> references still knocking around in remote domains (subsequent
> sends/receives fail):
>
> void idc_unregister_buffer
>   ( probably a pointer to the resource passed on registration );
>
> So, the 1000 statements of establishment code in the current drivers
> becomes:
>
> Receive an idc address from somewhere (resource discovery is outside the
> scope of this sketch).
>
> Allocate an IDC endpoint from somewhere (resource management is again
> outside the scope of this sketch).
>
> Call idc_endpoint_create.
>
> Wait for 'connect' before attempting to use connection for device
> specific protocol implemented using messages/transactions/remote buffer
> references.
>
> Call idc_endpoint_destroy and quiesce before unloading module.
>
> The implementation of the local buffer references and memory management
> can hide the use of pages which are shared between domains and reference
> counted to provide a zero copy implementation of bulk data transfer and
> shared page-caches.
>
> I implemented something very similar to this before for a cluster
> interconnect and it worked very nicely.  There are some subtleties to
> get right about the remote buffer reference implementation and the
> implications for out-of-order and idempotent bulk data transfers.
>
> As I said, it would require a few more days work to nail down a good
> API.
>
> Harry.
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel