xen-devel
Re: [Xen-devel] Re: Interdomain comms
Harry Butterworth wrote:
On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote:
Harry, thanks for bringing this to xen-devel discussion..
The inter-domain communication API should preserve the efficiency of
these primitives but provide a higher level API which is more convenient
to use.
We certainly need to simplify the API for the frontends -
making it easier to add frontends for new devices and OSs.
We also need to build in support for frontend - frontend
communication in an efficient way.
So, I think we're looking for a higher-level API which can preserve the
current efficient implementation for domains resident on the same
physical machine but allows for domains to be separated by a network
interface without having to rewrite all the drivers.
The API needs to address the following issues:
Resource discovery --- Discovering the targets of IDC is an inherent
requirement.
Dynamic behaviour --- Domains are going to come and go all the time.
Stale communications --- When domains come and go, client protocols must
have a way to recover from communications in flight or potentially in
flight from before the last transition.
Deadlock --- IDC is a shared resource and must not introduce resource
deadlock issues, for example when FE and BEs are arranged symetrically
in reverse across the same interface or when BEs are stacked and so
introduce chains of dependencies.
Security --- There are varying degrees of trust beween the domains.
Ease of use --- This is important for developer productivity and also to
help ensure the other goals (security/robustness) are actually met.
Efficiency/Performance --- obviously.
I'd need a few days (which I don't have right now) to put together a
coherent proposal tailored specifically to xen. However, it would
probably be along the lines of the following:
A buffer abstraction to decouple the IDC API from the memory management
implementation:
struct local_buffer_reference;
An endpoint abstraction to represent one end of an IDC connection. It's
important that this is done on a per connection basis rather than having
one per domain for all IDC activity because it avoids deadlock issues
arising from chained, dependent communication.
struct idc_endpoint;
A message abstraction because some protocols are more efficiently
implemented using one-way messages than request-response pairs,
particularly when the protocol involves more than two parties.
struct idc_message
{
...
struct local_buffer_reference message_body;
};
/* When a received message is finished with */
void idc_message_complete( struct idc_message * message );
A request-response transaction abstraction because most protocols are
more easily implemented with these.
struct idc_transaction
{
...
struct local_buffer_reference transaction_parameters;
struct local_buffer_reference transaction_status;
};
/* Useful to have an error code in addition to status. */
/* When a received transaction is finished with. */
void idc_transaction_complete
( struct idc_transaction * transaction, error_code error );
/* When an initiated transaction completes. Error code also reports
transport errors when endpoint disconnects whilst transaction is
outstanding. */
error_code idc_transaction_query_error_code
( struct idc_transaction * transaction );
An IDC address abstraction:
struct idc_address;
A mechanism to initiate connection establishment, can't fail because
endpoint resource is pre-allocated and create doesn't actually need to
establish the connection.
The endpoint calls the registered notification functions as follows:
'appear' when the remote endpoint is discovered then 'disappear' if it
goes away again or 'connect' if a connection is actually established.
After 'connect', the client can submit messages and transactions.
'disconnect' when the connection is failing, the client must wait for
outstanding messages and transactions to complete (sucessfully or with a
transport error) before completing the disconnect callback and must
flush received messages and transactions whilst disconnected.
Then 'connect' if the connection is reestablished or 'disappear' if the
remote endpoint has gone away.
A disconnect, connect cycle guarantees that the remote endpoint also
goes through a disconnect, connect cycle.
This API allows multi-pathing clients to make intelligent decisions and
provides sufficient guarantees about stale messages and transactions to
make a useful foundation.
void idc_endpoint_create
(
struct idc_endpoint * endpoint,
struct idc_address address,
void ( * appear )( struct idc_endpoint * endpoint ),
void ( * connect )( struct idc_endpoint * endpoint ),
void ( * disconnect )
( struct idc_endpoint * endpoint, struct callback * callback ),
void ( * disappear )( struct idc_endpoint * endpoint ),
void ( * handle_message )
( struct idc_endpoint * endpoint, struct idc_message * message ),
void ( * handle_transaction )
(
struct idc_endpoint * endpoint,
struct idc_transaction * transaction
)
);
void idc_endpoint_submit_message
( struct idc_endpoint * endpoint, struct idc_message * message );
void idc_endpoint_submit_transaction
( struct idc_endpoint * endpoint, struct idc_transaction *
transaction );
idc_endpoint_destroy completes the callback once the endpoint has
'disconnected' and 'disappeared' and the endpoint resource is free for
reuse for a different connection.
void idc_endpoint_destroy
(
struct idc_endpoint * endpoint,
struct callback * callback
);
The messages and transaction parameters and status must be of finite
length (these quota properties might be parameters of the endpoint
resource allocation). Need a mechanism for efficient, arbitrary length
bulk transfer too.
An abstraction for buffers owned by remote domains:
struct remote_buffer_reference;
Can register a local buffer with the IDC to get a remote buffer
reference:
struct remote_buffer_reference idc_register_buffer
( struct local_buffer_reference buffer, some kind of resource probably
required here );
remote buffer references may be passed between domains in idc messages
or transaction parameters or transaction status.
remote buffer references may be forwarded between domains and are usable
from any domain.
Once in posession of a remote buffer reference, a domain can transfer
data between the remote buffer and a local buffer:
void idc_send_to_remote_buffer
(
struct remote_buffer_reference remote_buffer,
struct local_buffer_reference local_buffer,
struct callback * callback, /* transfer completes asynchronously */
some kind of resource required here
);
void idc_receive_from_remote_buffer
(
struct remote_buffer_reference remote_buffer,
struct local_buffer_reference local_buffer,
struct callback * callback, /* Again, completes asynchronously */
some kind of resource required here
);
Can unregister to free a local buffer independent of remote buffer
references still knocking around in remote domains (subsequent
sends/receives fail):
void idc_unregister_buffer
( probably a pointer to the resource passed on registration );
So, the 1000 statements of establishment code in the current drivers
becomes:
Receive an idc address from somewhere (resource discovery is outside the
scope of this sketch).
Allocate an IDC endpoint from somewhere (resource management is again
outside the scope of this sketch).
Call idc_endpoint_create.
Wait for 'connect' before attempting to use connection for device
specific protocol implemented using messages/transactions/remote buffer
references.
Call idc_endpoint_destroy and quiesce before unloading module.
quiesce across remote nodes as well?
The implementation of the local buffer references and memory management
can hide the use of pages which are shared between domains and reference
counted to provide a zero copy implementation of bulk data transfer and
shared page-caches.
I implemented something very similar to this before for a cluster
interconnect and it worked very nicely. There are some subtleties to
get right about the remote buffer reference implementation and the
implications for out-of-order and idempotent bulk data transfers.
All the above looked very sane. How does stuff get out of order,
though? We have effectively per-device queues.
As I said, it would require a few more days work to nail down a good
API.
thanks,
Nivedita
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
|
|