-- Jeffrey Altman - 16 Jun 2008

The AFS File Server request throughput is limited by its current architecture which dedicates one thread per request for the lifetime of the request. Due to the fact that threads may become blocked on disk I/O and (more importantly) on Rx RPCs (VL_*, PR_*, RXAFSCB_*) the dedicated threads are frequently idle when they could be performing real work. The AFS File Server is therefore incapable of taking advantage of the CPU, disk I/O, and network I/O resources available to it.

A more effective architecture is one that is event-driven (or work-flow based). In such an architecture, the File Server would queue requests that are likely to block on I/O or Rx RPCs. The processing thread would then be free to begin processing a new incoming Rx request or continue processing an existing Rx request that has returned to the ready state.

This design is not currently possible because the Rx RPC application programming interfaces only provide for synchronous operations. The following is a proposal to add support for asynchronous requests. This proposal was developed in conjunction with Tom Keiser.

Provide an infrastructure for making Rx event-driven.

Most of Rx is geared towards a procedural paradigm. Extend Rx to provide several new primitives to allow for operations to be performed in an event-driven manner. Rx has a notion of events presently, but it is designed specifically to provide timeout-based event firing.

Generalize the existing rxevent data structure to support asynchronous events in addition to timeout events.
Provide a new API to create non-epoch event queues
Provide a new API to release events blocked on a queue
Provide a new API to block a call object on an Rx event queue
Presently, Rx worker threads dispatch ?RxRPC call objects as they arrive. Generalize the worker thread dispatch interface to dispatch any arbitrary event when it becomes ready for processing.
The rx_call object will now contain a pointer to an event object. This will be used by asynchronous calls to permit rescheduling of long-lived RPCs.
When designing the new APIs, consider the work that was performed by SNA for the instrumentation framework so that future consolidation efforts require minimal effort.
New form event objects will have two methods of being processed. First, via a callback function pointer. Secondly, via a synchronous waiter interface.

Add Asynchronous RX RPC server interface.

The present design is completely procedural. When an RPC call is ready to be serviced, a server stub is called. This stub unmarshals the incoming call parameters. Once the parameters are ready, the actual function which services the RPC is called. Upon return of this function, out parameters are marshalled, and the response is sent to the caller. This mode of operation is not generic enough to deal with cases where an in-progress RPC must be suspended pending the result of a long-lived operation (such as an RPC call to the ptserver). This project phase will attempt to decouple these operations so that a server thread may be freed up during the execution of the latent operation.

When calls are ready to be serviced, an event object will be enqueued. This event object can be dispatched when a worker thread becomes available. Worker threads will use the synchronous waiter interface to block awaiting calls to service. This change in behavior will, in effect, unify the SQE and rxevent mechanisms.
A new "async" keyword will be added to Rxgen. When present, this will emit a special server-side RPC stub. The asynchronous server stub will not automatically marshal output upon return of the user-provided servicing routine. Instead, it will check the state of the event object bound to the Rx call

object. If the event object is in the blocked state, then the subroutine will return immediately with no action. Otherwise, the the output arguments will be marshalled, and a response packet sent back to the caller.

To do:

add an async_done flag and a async_state integer to rx_call object; provide getter/setter interfaces
modify rxi_ServerProc() to check call->done before running after proc and ending call (otherwise assume call has been placed on a blocked queue by user)
add new rx interface to allow user to move an unblocked, partially finished call back onto the head of the sqe list
modify rxi_ReceivePacket() to initialize async_state to zero and async_done flag to 1 for new server call case (so that legacy synchronous calls continue to work without modification)

Add Asynchronous RX RPC client interface.

The rxgen procedure generator would be modified to optionally produce Asynchronous versions of the existing Synchronous procedure calls. Starting with the RXAFS_xxxx calls used for file server operations. The asynchronous calls would behave similar to the existing calls with the following changes:

Provide a non-blocking version of rx_NewCall() which returns an error code when all channels in a Rx connection object are in use.
After a message is XDR encoded, instead of calling rxi_EndCall() an asynchronous rxi_EndCallAsync() would be called. rxi_EndCallAsync() would send the message and either immediately return an error or would return a handle to the outstanding call.
A new rx_AsyncCallWait() function would be added to the RX library. This function would be passed an existing handle returned by

rxi_EndCallAsync() and would permit the caller to check the return status and if desired to block on the response. On the backend, this will use the synchronous event waiter interface from Section 1.

A new rx_FetchCompletedCall() function would be added to the RX library. This function would be used to retrieve the next completed asynchronous call. This function will be used to feed a task to an idle worker thread. This function can include parameters that can be used to filter the responses based upon the type of call to permit specialized worker threads. A parameter determines whether or not the call should block or not.
Instead of waking up blocked threads when a response is received from a peer as is done for the existing synchronous call model, RX will queue completed asynchronous calls for later querying by rx_FetchCompletedCall().
Implement asynchronous versions of multi_RXAFS_xxx calls.
Develop asynchronous version of pr_GetCPS() and pr_GetHostCPS().

To do:

define an async event handler object and associated getter/setters [1]
add async event handler pointer to rx_call object; provide getter/setter interfaces
non-blocking version of rx_NewCall() and associated changes to rxi_NewCall()
modify rxi_ServerProc() to run callback function for client call types
run a server thread pool, even if we're just a client (as a convenient way to service async event callback functions)
make rx_NewCall take an rs_async_event object pointer (NULL value implying synchronous call)
modify rxgen to optionally emit async client stubs -- the async start stub will XDR encode and then calls rx_FlushWrite() [i think we can safely ignore blocking related to TQ_BUSY]; the async finish stub will XDR decode and ?EndCall.
modify call timeout event handler to call async event handler, if there is one

[1] async event:

rationale: make it future-proof, but only give one generalized option for now.

typedef enum { RX_ASYNC_EVENT_DO_CALLBACK, RX_ASYNC_EVENT_DO_NIL, } rx_async_event_action_t;

typedef enum { RX_ASYNC_EVENT_TYPE_DONE, RX_ASYNC_EVENT_TYPE_ERROR, RX_ASYNC_EVENT_TYPE_TIMEOUT } rx_async_event_type_t;

typedef int rx_async_event_callback_func_t(struct rx_call *, rx_async_event_type_t, void *);

typedef struct { rx_async_event_callback_func_t * fp; void * rock; } rx_async_event_callback_t;

typedef struct { rx_async_event_action_t action; union { rx_async_event_callback_t callback; afs_uint32 pad[15]; /* simultaneously pad it on a typical cache line size, and make the struct large so we could extend in future without abi breakage */ } u; } rx_async_event_t;

define rx_async_event_SetupNil(event) (event)->action = RX_ASYNC_EVENT_DO_NIL

define rx_async_event_SetupCallback(event, func, rock) do { (event)->action = RX_ASYNC_EVENT_DO_CALLBACK; (event)->u.callback.fp = (func); (event)->u.callback.rock = (rock); while (0)

Modify the File Server to use asynchronous RPCs.

Replace h_GetHost_r() ?WhoAreYou probes with asynchronous version. This requires breaking up h_GetHost_r() into pieces that are compatible with the new state machine.
Re-write ?MultiBreakCallBackAlternateAddress_r() to use asynchronous version of multi_RXAFSCB_CallBack()
Re-write ?MultiProbeAlternateAddress_r() to use asynchronous version of multi_RXAFSCB_ProbeUuid()
Re-write calls to ?GetCPS and ?GetHostCPS to use asynchronous versions.
Construct daemon thread that calls rx_FetchCompletedCall() and thread pool of worker threads to which the tasks can be handed off for completion.

To do:

rework host holds as an integer refcount
decouple ubik client structs from threads; provide a global pool of them
modify call stack from service routine through ?CallPreamble down to h_GetHost_r to allow returning a flag which signifies call should be blocked pending ?TellMeAboutYourself/WhoAreYou completion
modify call stack from service routine through ?CallPreamble down to hpr_GetCPS to allow returning a flag which signifies call should be blocked pending ubik_pr_GetCPS completion
modify server stubs to deal with ?CallPreamble returning a special "not done" error code, which results in the routine returning immediately with call->async_done set to zero
modify server stubs, callpreamable, and relevant parts of host package to check async state value, and potentially jump to a specific line in function (probably with a select case type of construct)
rework ?MultiBreakCallBack_r use async client. initiate calls as quickly as possible; multi_error case and putconn can be handled by callback function

Modify host/callback package locking/threading model.

The existing host/callback package design assumes that a given RPC call will be serviced on the same thread from start to finish. This assumption will no longer be true with asynchronous Rx. We must modify the design to accommodate this new dynamic.

Replace host holds bit vector with a reference count
per-thread ubik client object is no longer a feasible solution. Instead, we will provide a pool of ubik client objects. Threads will allocate a new object from the pool whenever a call needs to be made.

Provide a high level Rx session object.

Many parts of AFS are constrained by the lack of call parallelism provided with the Rx connection object (4 simultaneous calls). This proposal will provide a new generalized high-level connection handle (which will be called an Rx session for the purposes of this discussion). A session will be a container which contains: one Rx security object; one UUID to identify the peer; an arbitrary number of protocol addresses for the peer; and an arbitrary number of Rx connection objects.

design an Rx session data structure
Write an Rx session object allocator, deallocator, initializer, and finalizer routines
Use Rx session objects in the host/callback package to allow

Provide asynchronous versions of rx_Read/rx_Write/rx_Readv/rx_Writev.

At present, Rx data streaming operations are synchronous. This means that data read and write RPCs will block a thread from start to finish. Over highly latent links, this could become a performance issue.

Add asynchronous support to Ubik Client:

To do:

add async event handler object pointer to ubik_client structure
rework rxgen ubik_client code to use async client stubs
create background thread(s) which handle ubik call timeouts
callback func checks return code; success can be processed inline; failures get pushed to ubik client background threads [i'm specifying background threads here in order to remove any danger of deadlock due to the fact that rx would be in the call stack twice]
background threads schedule new async calls against a different site, or against sync site depending on diagnosis of last call