Overview
A SkyframeStateMachine is a deconstructed function-object that resides on
the heap. It supports flexible and evaluation without redundancy1 when
required values are not immediately available but computed asynchronously. The
StateMachine cannot tie up a thread resource while waiting, but instead has to
be suspended and resumed. The deconstruction thus exposes explicit re-entry
points so that prior computations can be skipped.
StateMachines can be used to express sequences, branching, structured logical
concurrency and are tailored specifically for Skyframe interaction.
StateMachines can be composed into larger StateMachines and share
sub-StateMachines. Concurrency is always hierarchical by construction and
purely logical. Every concurrent subtask runs in the single shared parent
SkyFunction thread.
Introduction
This section briefly motivates and introducesStateMachines, found in the
java.com.google.devtools.build.skyframe.state
package.
A brief introduction to Skyframe restarts
Skyframe is a framework that performs parallel evaluation of dependency graphs. Each node in the graph corresponds with the evaluation of a SkyFunction with a SkyKey specifying its parameters and SkyValue specifying its result. The computational model is such that a SkyFunction may lookup SkyValues by SkyKey, triggering recursive, parallel evaluation of additional SkyFunctions. Instead of blocking, which would tie up a thread, when a requested SkyValue is not yet ready because some subgraph of computation is incomplete, the requesting SkyFunction observes anull getValue response and should return null
instead of a SkyValue, signaling that it is incomplete due to missing inputs.
Skyframe restarts the SkyFunctions when all previously requested SkyValues
become available.
Before the introduction of SkyKeyComputeState, the traditional way of handling
a restart was to fully rerun the computation. Although this has quadratic
complexity, functions written this way eventually complete because each rerun,
fewer lookups return null. With SkyKeyComputeState it is possible to
associate hand-specified check-point data with a SkyFunction, saving significant
recomputation.
StateMachines are objects that live inside SkyKeyComputeState and eliminate
virtually all recomputation when a SkyFunction restarts (assuming that
SkyKeyComputeState does not fall out of cache) by exposing suspend and resume
execution hooks.
Stateful computations inside SkyKeyComputeState
From an object-oriented design standpoint, it makes sense to consider storing
computational objects inside SkyKeyComputeState instead of pure data values.
In Java, the bare minimum description of a behavior carrying object is a
functional interface and it turns out to be sufficient. A StateMachine has
the following, curiously recursive, definition2.
Tasks interface is analogous to SkyFunction.Environment but it is
designed for asynchrony and adds support for logically concurrent subtasks3.
The return value of step is another StateMachine, allowing the specification
of a sequence of steps, inductively. step returns DONE when the
StateMachine is done. For example:
StateMachine with the following output.
this::step2 is also a StateMachine due to
step2 satisfying StateMachine’s functional interface definition. Method
references are the most common way to specify the next state in a
StateMachine.
StateMachine steps, instead of a
monolithic function, provides the hooks needed to suspend and resume a
computation. When StateMachine.step returns, there is an explicit suspension
point. The continuation specified by the returned StateMachine value is an
explicit resume point. Recomputation can thus be avoided because the
computation can be picked up exactly where it left off.
Callbacks, continuations and asynchronous computation
In technical terms, aStateMachine serves as a continuation, determining the
subsequent computation to be executed. Instead of blocking, a StateMachine can
voluntarily suspend by returning from the step function, which transfers
control back to a Driver instance. The Driver can
then switch to a ready StateMachine or relinquish control back to Skyframe.
Traditionally, callbacks and continuations are conflated into one concept.
However, StateMachines maintain a distinction between the two.
- Callback - describes where to store the result of an asynchronous computation.
- Continuation - specifies the next execution state.
StateMachine return values of StateMachines and
encapsulate the complex execution that follows once all asynchronous
computations resolve. This structured approach helps to keep the complexity of
callbacks manageable.
Tasks
TheTasks interface provides StateMachines with an API to lookup SkyValues
by SkyKey and to schedule concurrent subtasks.
Tasks interface to perform lookups or create
subtasks, those lookups and subtasks will complete before the next state begins.
Tip: (Corollary) If subtasks are complex StateMachines or recursively create
subtasks, they all transitively complete before the next state begins.
SkyValue lookups
StateMachines use Tasks.lookUp overloads to look up SkyValues. They are
analogous to SkyFunction.Environment.getValue and
SkyFunction.Environment.getValueOrThrow and have similar exception handling
semantics. The implementation does not immediately perform the lookup, but
instead, batches4 as many lookups as possible before doing so. The value
might not be immediately available, for example, requiring a Skyframe restart,
so the caller specifies what to do with the resulting value using a callback.
The StateMachine processor (Drivers and bridging to
SkyFrame) guarantees that the value is available before
the next state begins. An example follows.
new Key(), passing
this as the consumer. That is possible because DoesLookup implements
Consumer<SkyValue>.
Tip: When passing this as a value sink, it’s helpful to readers to upcast it
to the receiver type to narrow down the purpose of passing this. The example
passes (Consumer<SkyValue>) this.
By contract, before the next state DoesLookup.processValue begins, all the
lookups of DoesLookup.step are complete. Therefore value is available when
it is accessed in processValue.
Subtasks
Tasks.enqueue requests the execution of logically concurrent subtasks.
Subtasks are also StateMachines and can do anything regular StateMachines
can do, including recursively creating more subtasks or looking up SkyValues.
Much like lookUp, the state machine driver ensures that all subtasks are
complete before proceeding to the next step. An example follows.
Subtask1 and Subtask2 are logically concurrent, everything runs in a
single thread so the “concurrent” update of i does not need any
synchronization.
Structured concurrency
Since everylookUp and enqueue must resolve before advancing to the next
state, it means that concurrency is naturally limited to tree-structures. It’s
possible to create hierarchical5 concurrency as shown in the following
example.
Composition and control flow patterns
This section presents examples for how multipleStateMachines can be composed
and solutions to certain control flow problems.
Sequential states
This is the most common and straightforward control flow pattern. An example of this is shown in Stateful computations insideSkyKeyComputeState.
Branching
Branching states inStateMachines can be achieved by returning different
values using regular Java control flow, as shown in the following example.
DONE, for early completion.
Advanced sequential composition
Since theStateMachine control structure is memoryless, sharing StateMachine
definitions as subtasks can sometimes be awkward. Let M1 and
M2 be StateMachine instances that share a StateMachine, S,
with M1 and M2 being the sequences <A, S, B> and
<X, S, Y> respectively. The problem is that S doesn’t know whether to
continue to B or Y after it completes and StateMachines don’t quite keep a
call stack. This section reviews some techniques for achieving this.
StateMachine as terminal sequence element
This doesn’t solve the initial problem posed. It only demonstrates sequential
composition when the shared StateMachine is terminal in the sequence.
Subtask for sequential composition
Since enqueued subtasks are guaranteed to complete before the next state, it’s sometimes possible to slightly abuse6 the subtask mechanism.runAfter injection
Sometimes, abusing Tasks.enqueue is impossible because there are other
parallel subtasks or Tasks.lookUp calls that must be completed before S
executes. In this case, injecting a runAfter parameter into S can be used to
inform S of what to do next.
StateMachines with runAfter, is
the road to Callback Hell. It’s better to break up sequential
runAfters with ordinary sequential states instead.
DONE as the runAfter parameter when there’s
nothing to run afterwards.
Tip: When using runAfter, always annotate the parameter with /* runAfter= */
to let the reader know the meaning at the callsite.
Forbidden alternative: runAfterUnlessError
In an earlier draft, we had considered a runAfterUnlessError that would abort
early on errors. This was motivated by the fact that errors often end up getting
checked twice, once by the StateMachine that has a runAfter reference and
once by the runAfter machine itself.
After some deliberation, we decided that uniformity of the code is more
important than deduplicating the error checking. It would be confusing if the
runAfter mechanism did not work in a consistent manner with the
tasks.enqueue mechanism, which always requires error checking.
Warning: When using runAfter, the machine that has the injected runAfter
should invoke it unconditionally at completion, even on error, for consistency.
Direct delegation
Each time there is a formal state transition, the mainDriver loop advances.
As per contract, advancing states means that all previously enqueued SkyValue
lookups and subtasks resolve before the next state executes. Sometimes the logic
of a delegate StateMachine makes a phase advance unnecessary or
counterproductive. For example, if the first step of the delegate performs
SkyKey lookups that could be parallelized with lookups of the delegating state
then a phase advance would make them sequential. It could make more sense to
perform direct delegation, as shown in the example below.
Data flow
The focus of the previous discussion has been on managing control flow. This section describes the propagation of data values.Implementing Tasks.lookUp callbacks
There’s an example of implementing a Tasks.lookUp callback in SkyValue
lookups. This section provides rationale and suggests
approaches for handling multiple SkyValues.
Tasks.lookUp callbacks
The Tasks.lookUp method takes a callback, sink, as a parameter.
myValue being a member variable of the StateMachine instance doing the
lookup. However, the lambda requires an extra memory allocation compared to
implementing the Consumer<SkyValue> interface in the StateMachine
implementation. The lambda is still useful when there are multiple lookups that
would be ambiguous.
Note: Bikeshed warning. There is a noticeable difference of approximately 1%
end-to-end CPU usage when implementing callbacks systematically in
StateMachine implementations compared to using lambdas, which makes this
recommendation debatable. To avoid unnecessary debates, it is advised to leave
the decision up to the individual implementing the solution.
There are also error handling overloads of Tasks.lookUp, that are analogous to
SkyFunction.Environment.getValueOrThrow.
StateMachine class directly
implement the callback saves a memory allocation for the lamba.
Error handling provides a bit more detail, but essentially,
there’s not much difference between the propagation of errors and normal values.
Consuming multiple SkyValues
Multiple SkyValue lookups are often required. An approach that works much of the time is to switch on the type of SkyValue. The following is an example that has been simplified from prototype production code.Consumer<SkyValue> callback implementation can be shared unambiguously
because the value types are different. When that’s not the case, falling back to
lambda-based implementations or full inner-class instances that implement the
appropriate callbacks is viable.
Propagating values between StateMachines
So far, this document has only explained how to arrange work in a subtask, but
subtasks also need to report a values back to the caller. Since subtasks are
logically asynchronous, their results are communicated back to the caller using
a callback. To make this work, the subtask defines a sink interface that is
injected via its constructor.
accept(Bar value) rather than the stuttery void acceptBarValue(Bar value) above.
However, Consumer<SkyValue> is a common overload of void accept(Bar value),
so doing this often leads to violations of the Overloads: never
split
style-guide rule.
Tip: Using a custom ResultSink type instead of a generic one from
java.util.function makes it easy to find implementations in the code base,
improving readability.
A caller StateMachine would then look like the following.
Caller has to propagate its
results back and defines its own Caller.ResultSink. Caller implements the
BarProducer.ResultSink callbacks. Upon resumption, processResult checks if
value is null to determine if an error occurred. This is a common behavior
pattern after accepting output from either a subtask or SkyValue lookup.
Note that the implementation of acceptBarError eagerly forwards the result to
the Caller.ResultSink, as required by Error bubbling.
Alternatives for top-level StateMachines are described in Drivers and
bridging to SkyFunctions.
Error handling
There’s a couple of examples of error handling already inTasks.lookUp
callbacks and Propagating values between
StateMachines. Exceptions, other than
InterruptedException are not thrown, but instead passed around through
callbacks as values. Such callbacks often have exclusive-or semantics, with
exactly one of a value or error being passed.
The next section describes a a subtle, but important interaction with Skyframe
error handling.
Error bubbling (—nokeep_going)
Warning: Errors need to be eagerly propagated all the way back to the SkyFunction for error bubbling to function correctly. During error bubbling, a SkyFunction may be restarted even if not all requested SkyValues are available. In such cases, the subsequent state will never be reached due to theTasks API contract. However, the StateMachine should
still propagate the exception.
Since propagation must occur regardless of whether the next state is reached,
the error handling callback must perform this task. For an inner StateMachine,
this is achieved by invoking the parent callback.
At the top-level StateMachine, which interfaces with the SkyFunction, this can
be done by calling the setException method of ValueOrExceptionProducer.
ValueOrExceptionProducer.tryProduceValue will then throw the exception, even
if there are missing SkyValues.
If a Driver is being utilized directly, it is essential to check for
propagated errors from the SkyFunction, even if the machine has not finished
processing.
Event Handling
For SkyFunctions that need to emit events, aStoredEventHandler is injected
into SkyKeyComputeState and further injected into StateMachines that require
them. Historically, the StoredEventHandler was needed due to Skyframe dropping
certain events unless they are replayed but this was subsequently fixed.
StoredEventHandler injection is preserved because it simplifies the
implementation of events emitted from error handling callbacks.
Drivers and bridging to SkyFunctions
A Driver is responsible for managing the execution of StateMachines,
beginning with a specified root StateMachine. As StateMachines can
recursively enqueue subtask StateMachines, a single Driver can manage
numerous subtasks. These subtasks create a tree structure, a result of
Structured concurrency. The Driver batches SkyValue
lookups across subtasks for improved efficiency.
There are a number of classes built around the Driver, with the following API.
Driver takes a single root StateMachine as a parameter. Calling
Driver.drive executes the StateMachine as far as it can go without a
Skyframe restart. It returns true when the StateMachine completes and false
otherwise, indicating that not all values were available.
Driver maintains the concurrent state of the StateMachine and it is well
suited for embedding in SkyKeyComputeState.
Directly instantiating Driver
StateMachine implementations conventionally communicate their results via
callbacks. It’s possible to directly instantiate a Driver as shown in the
following example.
The Driver is embedded in the SkyKeyComputeState implementation along with
an implementation of the corresponding ResultSink to be defined a bit further
down. At the top level, the State object is an appropriate receiver for the
result of the computation as it is guaranteed to outlive Driver.
ResultProducer.
Embedding Driver
If the StateMachine produces a value and raises no exceptions, embedding
Driver is another possible implementation, as shown in the following example.
State is
the function specific type of SkyKeyComputeState).
Driver in the StateMachine implementation is a better fit for
Skyframe’s synchronous coding style.
StateMachines that may produce exceptions
Otherwise, there areSkyKeyComputeState-embeddable ValueOrExceptionProducer
and ValueOrException2Producer classes that have synchronous APIs to match
synchronous SkyFunction code.
The ValueOrExceptionProducer abstract class includes the following methods.
Driver instance and closely resembles the
ResultProducer class in Embedding driver and interfaces
with the SkyFunction in a similar manner. Instead of defining a ResultSink,
implementations call setValue or setException when either of those occur.
When both occur, the exception takes priority. The tryProduceValue method
bridges the asynchronous callback code to synchronous code and throws an
exception when one is set.
As previously noted, during error bubbling, it’s possible for an error to occur
even if the machine is not yet done because not all inputs are available. To
accommodate this, tryProduceValue throws any set exceptions, even before the
machine is done.
Epilogue: Eventually removing callbacks
StateMachines are a highly efficient, but boilerplate intensive way to perform
asynchronous computation. Continuations (particularly in the form of Runnables
passed to ListenableFuture) are widespread in certain parts of Bazel code,
but aren’t prevalent in analysis SkyFunctions. Analysis is mostly CPU bound and
there are no efficient asynchronous APIs for disk I/O. Eventually, it would be
good to optimize away callbacks as they have a learning curve and impede
readability.
One of the most promising alternatives is Java virtual threads. Instead of
having to write callbacks, everything is replaced with synchronous, blocking
calls. This is possible because tying up a virtual thread resource, unlike a
platform thread, is supposed to be cheap. However, even with virtual threads,
replacing simple synchronous operations with thread creation and synchronization
primitives is too expensive. We performed a migration from StateMachines to
Java virtual threads and they were orders of magnitude slower, leading to
almost a 3x increase in end-to-end analysis latency. Since virtual threads are
still a preview feature, it’s possible that this migration can be performed at a
later date when performance improves.
Another approach to consider is waiting for Loom coroutines, if they ever
become available. The advantage here is that it might be possible to reduce
synchronization overhead by using cooperative multitasking.
If all else fails, low-level bytecode rewriting could also be a viable
alternative. With enough optimization, it might be possible to achieve
performance that approaches hand-written callback code.
Appendix
Callback Hell
Callback hell is an infamous problem in asynchronous code that uses callbacks. It stems from the fact that the continuation for a subsequent step is nested within the previous step. If there are many steps, this nesting can be extremely deep. If coupled with control flow the code becomes unmanageable.runAfter injection
pattern is used too densely, but this can be avoided by interspersing injections
with sequential steps.
Example: Chained SkyValue lookups
It is often the case that the application logic requires dependent chains of SkyValue lookups, for example, if a second SkyKey depends on the first SkyValue. Thinking about this naively, this would result in a complex, deeply nested callback structure.step2 follows step1. Note that here, a
lambda is used to assign value2. This makes the ordering of the code match the
ordering of the computation from top-to-bottom.
Miscellaneous Tips
Readability: Execution Ordering
To improve readability, strive to keep theStateMachine.step implementations
in execution order and callback implementations immediately following where they
are passed in the code. This isn’t always possible where the control flow
branches. Additional comments might be helpful in such cases.
In Example: Chained SkyValue lookups, an
intermediate method reference is created to achieve this. This trades a small
amount of performance for readability, which is likely worthwhile here.
Generational Hypothesis
Medium-lived Java objects break the generational hypothesis of the Java garbage collector, which is designed to handle objects that live for a very short time or objects that live forever. By definition, objects inSkyKeyComputeState violate this hypothesis. Such objects, containing the
constructed tree of all still-running StateMachines, rooted at Driver have
an intermediate lifespan as they suspend, waiting for asynchronous computations
to complete.
It seems less bad in JDK19, but when using StateMachines, it’s sometimes
possible to observe an increase in GC time, even with dramatic decreases in
actual garbage generated. Since StateMachines have an intermediate lifespan
they could be promoted to old gen, causing it to fill up more quickly, thus
necessitating more expensive major or full GCs to clean up.
The initial precaution is to minimize the use of StateMachine variables, but
it is not always feasible, for example, if a value is needed across multiple
states. Where it is possible, local stack step variables are young generation
variables and efficiently GC’d.
For StateMachine variables, breaking things down into subtasks and following
the recommended pattern for Propagating values between
StateMachines is also helpful. Observe that when
following the pattern, only child StateMachines have references to parent
StateMachines and not vice versa. This means that as children complete and
update the parents using result callbacks, the children naturally fall out of
scope and become eligible for GC.
Finally, in some cases, a StateMachine variable is needed in earlier states
but not in later states. It can be beneficial to null out references of large
objects once it is known that they are no longer needed.
Naming states
When naming a method, it’s usually possible to name a method for the behavior that happens within that method. It’s less clear how to do this inStateMachines because there is no stack. For example, suppose method foo
calls a sub-method bar. In a StateMachine, this could be translated into the
state sequence foo, followed by bar. foo no longer includes the behavior
bar. As a result, method names for states tend to be narrower in scope,
potentially reflecting local behavior.
Concurrency tree diagram
The following is an alternative view of the diagram in Structured concurrency that better depicts the tree structure. The blocks form a small tree.Footnotes
- In contrast to Skyframe’s convention of restarting from the beginning when values are not available. ↩
-
Note that
stepis permitted to throwInterruptedException, but the examples omit this. There are a few low methods in Bazel code that throw this exception and it propagates up to theDriver, to be described later, that runs theStateMachine. It’s fine to not declare it to be thrown when unneeded. ↩ -
Concurrent subtasks were motivated by the
ConfiguredTargetFunctionwhich performs independent work for each dependency. Instead of manipulating complex data structures that process all the dependencies at once, introducing inefficiencies, each dependency has its own independentStateMachine. ↩ -
Multiple
tasks.lookUpcalls within a single step are batched together. Additional batching can be created by lookups occurring within concurrent subtasks. ↩ - This is conceptually similar to Java’s structured concurrency jeps/428. ↩
- Doing this is similar to spawning a thread and joining it to achieve sequential composition. ↩