[Xen-devel] Re: Interdomain comms

On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:
> Harry Butterworth wrote:
> 
> > The current overhead in terms of client code to establish an entity on
> > the xen inter-domain communication "bus" is currently of the order of
> > 1000 statements (counting FE, BE and slice of xend).  A better
> > inter-domain communication API could reduce this to fewer than 10
> > statements.  If it's not done by the time I finish the USB work, I will
> > hopefully be allowed to help with this.
> > 
> 
> This reminded me you had suggested a different model for inter-domain comms.
> I recently suggested a more socket-like API but it didn't go down well.

What exactly were the issues with the socket-like proposal?

> 
> I agree with you that the event channel model could be improved -
> what kind of comms model do you suggest?

The event-channel and shared memory page are fine as low-level
primitives to implement a comms channel between domains on the same
physical machine. The problem is that the primitives are unnecessarily
low-level from the client's perspective and result in too much
per-client code.

The inter-domain communication API should preserve the efficiency of
these primitives but provide a higher level API which is more convenient
to use.

Another issue with the current API is that, in the future, it is likely
(for a number of virtual-iron/fault-tolerant-virtual-machine-like
reasons) that it will be useful for the inter-domain communication API
to span physical nodes in a cluster. The problem with the current API is
that it directly couples the clients to a shared memory implementation
with a direct connection between the front and back end domains and the
clients would all need to be rewritten if the implementation was to span
physical machines or require indirection. Eventually I would expect the
effort invested in the clients of the inter-domain API to equal or
exceed the effort invested in the hypervisor in the same way that the
linux device drivers make up the bulk of the linux kernel code. There is
a risk therefore that this might become a significant architectural
limitation.

So, I think we're looking for a higher-level API which can preserve the
current efficient implementation for domains resident on the same
physical machine but allows for domains to be separated by a network
interface without having to rewrite all the drivers.

The API needs to address the following issues:

Resource discovery --- Discovering the targets of IDC is an inherent
requirement.

Dynamic behaviour --- Domains are going to come and go all the time.

Stale communications --- When domains come and go, client protocols must
have a way to recover from communications in flight or potentially in
flight from before the last transition.

Deadlock --- IDC is a shared resource and must not introduce resource
deadlock issues, for example when FE and BEs are arranged symetrically
in reverse across the same interface or when BEs are stacked and so
introduce chains of dependencies.

Security --- There are varying degrees of trust beween the domains.

Ease of use --- This is important for developer productivity and also to
help ensure the other goals (security/robustness) are actually met.

Efficiency/Performance --- obviously.

I'd need a few days (which I don't have right now) to put together a
coherent proposal tailored specifically to xen.  However, it would
probably be along the lines of the following:

A buffer abstraction to decouple the IDC API from the memory management
implementation:

struct local_buffer_reference;

An endpoint abstraction to represent one end of an IDC connection.  It's
important that this is done on a per connection basis rather than having
one per domain for all IDC activity because it avoids deadlock issues
arising from chained, dependent communication.

struct idc_endpoint;

A message abstraction because some protocols are more efficiently
implemented using one-way messages than request-response pairs,
particularly when the protocol involves more than two parties.

struct idc_message
{
    ...
    struct local_buffer_reference message_body;
};

/* When a received message is finished with */

void idc_message_complete( struct idc_message * message );

A request-response transaction abstraction because most protocols are
more easily implemented with these.

struct idc_transaction
{
    ...
    struct local_buffer_reference transaction_parameters;
    struct local_buffer_reference transaction_status;
};

/* Useful to have an error code in addition to status.  */

/* When a received transaction is finished with. */

void idc_transaction_complete
  ( struct idc_transaction * transaction, error_code error );

/* When an initiated transaction completes. Error code also reports
transport errors when endpoint disconnects whilst transaction is
outstanding. */

error_code idc_transaction_query_error_code
  ( struct idc_transaction * transaction );

An IDC address abstraction:

struct idc_address;

A mechanism to initiate connection establishment, can't fail because
endpoint resource is pre-allocated and create doesn't actually need to
establish the connection.

The endpoint calls the registered notification functions as follows:

'appear' when the remote endpoint is discovered then 'disappear' if it
goes away again or 'connect' if a connection is actually established.

After 'connect', the client can submit messages and transactions.

'disconnect' when the connection is failing, the client must wait for
outstanding messages and transactions to complete (sucessfully or with a
transport error) before completing the disconnect callback and must
flush received messages and transactions whilst disconnected.

Then 'connect' if the connection is reestablished or 'disappear' if the
remote endpoint has gone away.

A disconnect, connect cycle guarantees that the remote endpoint also
goes through a disconnect, connect cycle.

This API allows multi-pathing clients to make intelligent decisions and
provides sufficient guarantees about stale messages and transactions to
make a useful foundation.

void idc_endpoint_create
(
    struct idc_endpoint * endpoint,
    struct idc_address address,
    void ( * appear     )( struct idc_endpoint * endpoint ),
    void ( * connect    )( struct idc_endpoint * endpoint ),
    void ( * disconnect )
      ( struct idc_endpoint * endpoint, struct callback * callback ),
    void ( * disappear )( struct idc_endpoint * endpoint ),
    void ( * handle_message )
      ( struct idc_endpoint * endpoint, struct idc_message * message ),
    void ( * handle_transaction )
    (
        struct idc_endpoint * endpoint,
        struct idc_transaction * transaction
    )
);

void idc_endpoint_submit_message
  ( struct idc_endpoint * endpoint, struct idc_message * message );

void idc_endpoint_submit_transaction
  ( struct idc_endpoint * endpoint, struct idc_transaction *
transaction );

idc_endpoint_destroy completes the callback once the endpoint has
'disconnected' and 'disappeared' and the endpoint resource is free for
reuse for a different connection.

void idc_endpoint_destroy
(
    struct idc_endpoint * endpoint,
    struct callback * callback
);

The messages and transaction parameters and status must be of finite
length (these quota properties might be parameters of the endpoint
resource allocation). Need a mechanism for efficient, arbitrary length
bulk transfer too.

An abstraction for buffers owned by remote domains:

struct remote_buffer_reference;

Can register a local buffer with the IDC to get a remote buffer
reference:

struct remote_buffer_reference idc_register_buffer
  ( struct local_buffer_reference buffer, some kind of resource probably
required here );

remote buffer references may be passed between domains in idc messages
or transaction parameters or transaction status.

remote buffer references may be forwarded between domains and are usable
from any domain.

Once in posession of a remote buffer reference, a domain can transfer
data between the remote buffer and a local buffer:

void idc_send_to_remote_buffer
(
    struct remote_buffer_reference remote_buffer,
    struct local_buffer_reference local_buffer,
    struct callback * callback, /* transfer completes asynchronously */
    some kind of resource required here
);

void idc_receive_from_remote_buffer
(
    struct remote_buffer_reference remote_buffer,
    struct local_buffer_reference local_buffer,
    struct callback * callback, /* Again, completes asynchronously */
    some kind of resource required here
);

Can unregister to free a local buffer independent of remote buffer
references still knocking around in remote domains (subsequent
sends/receives fail):

void idc_unregister_buffer
  ( probably a pointer to the resource passed on registration );

So, the 1000 statements of establishment code in the current drivers
becomes:

Receive an idc address from somewhere (resource discovery is outside the
scope of this sketch).

Allocate an IDC endpoint from somewhere (resource management is again
outside the scope of this sketch).

Call idc_endpoint_create.

Wait for 'connect' before attempting to use connection for device
specific protocol implemented using messages/transactions/remote buffer
references.

Call idc_endpoint_destroy and quiesce before unloading module.

The implementation of the local buffer references and memory management
can hide the use of pages which are shared between domains and reference
counted to provide a zero copy implementation of bulk data transfer and
shared page-caches.

I implemented something very similar to this before for a cluster
interconnect and it worked very nicely.  There are some subtleties to
get right about the remote buffer reference implementation and the
implications for out-of-order and idempotent bulk data transfers.

As I said, it would require a few more days work to nail down a good
API.

Harry.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Re: Interdomain comms