# TensorPipe's thread model

TensorPipe is spawning multiple threads internally. This is a design
requirement as, for example, a single thread wouldn't manage to drive a
modern network interface card (NIC) at capacity and saturate its
bandwidth, even if it did nothing by write on the socket: multiple
threads writing in parallel to multiple sockets are the only way to
achieve that.

Moreover, the possibility of spawning new threads when needed allows
for a simpler architecture in the implementation of TensorPipe's
modular approach to backends (transports and channels): if one of these
backends needs to perform some heavy operation (a blocking syscall, an
event loop, ...) it can launch a dedicated thread for it rather than
having to schedule it on the user thread or on a shared thread pool,
thus having to "fit" the operation into some framework.

This heavy reliance on multi-threading poses of course challenges in
coordination and robustness. This document aims to outline the patterns
we've ended up adopting to have a structured and principled design
around this.

## Callbacks

TensorPipe uses callbacks to organize the control flow around
asynchronous and deferred execution. While this may be an anti-pattern
leading to so-called "spaghetti code" or "callback hell", we realized
that it was the only approach that would yield the performance we need.
Modern alternatives to callbacks (promises/futures, coroutines, ...) 
would have introduced an unacceptable overhead in some cases.

Nearly all operations in TensorPipe are non-blocking and are performed
asynchronously, in background, with their results notified through
callbacks. This includes the creation of pipes and connections (the
objects may still be performing initialization when they are given to
the user and, although operations can be performed on them, these will
be delayed until setup completes). And it also includes destruction,
which means that internal resources may not be immediately freed when a
user-facing object is deleted. The only synchronization point that
allows the user to wait for such cleanup to finish is the context's
`join` method. Some other methods that may occasionally wait are the
ones that return a value, for example the ones to retrieve addresses.

## Shared pointers

As soon as threads and callbacks enter the mix, race conditions start
to pop up. Among the first ones, there's the problem of ownership:
ideally we want a `unique_ptr`-style semantics, where each object has a
clear owner who controls its lifetime. However, when this owner asks
another thread to perform an operation on that object as part of a
callback, that callback also (temporarily) needs access to the object.
As there may be multiple operations with multiple callbacks at the same
time, transferring ownership isn't an option, and sharing it is the
only way to go. This however requires synchronization among the various
users: if the "real" user had a `unique_ptr` and gave raw pointers to
the callbacks, the real user may delete the object without the
callbacks noticing or having any way to stop/delay it. This would then
cause use-after-free errors. There must thus be a sort of "lock" that
prevents the object from being deleted while someone is working on it,
like a "semaphore" counting the users. It turns out a perfect tool for
the job is `shared_ptr`. Acquiring a lock on the object corresponds to
obtaining a `shared_ptr` instance, which increases the reference count.
The object will only be deleted when its refcount reaches zero, which
means all its users (the "real" ones and the callbacks) have stopped
using the object.

We have however solved a problem by creating an opposite one: a memory
leak. Imagine an object (say, a pipe) that is the "real" owner of
another one (say, a channel) from which it is expecting a callback, and
that callback captures a `shared_ptr` to the first object in its
closure. This is a reference cycle. It means that even if the "real"
owner of the first object relinquishes its `shared_ptr`, the objects
won't be destroyed until the callback fires (if ever). An easy solution
to this is to have callbacks only keep a `shared_ptr` when they are
running, not while they are waiting. Again, the standard library has
the perfect tool for the job: the `weak_ptr`, which will keep the
refcount unchanged but can be "locked" to obtain a real `shared_ptr`
when needed (curious coincidence that the terminology aligns with our).

So, in short: the real owner of an object keeps a `shared_ptr` to it,
it passes `weak_ptr`s to be stored in callbacks, and these are locked
back to `shared_ptr`s just before running the callbacks. (If locking
fails, the callback isn't run).

## Public objects vs private implementations

It turns out that what we said above isn't always true: in some cases
we may want a callback to keep the object alive until it has fired.
This happens because some callbacks are one half of a "contract"
regarding data ownership: throughout the API (at higher and lower
levels), `read`, `write`, `send` and `recv` methods take some data
(source or destination buffers), and by doing so the caller hands over
control of the data to the object. The way for the object to yield
ownership back to the caller is by invoking the callback. We must thus
ensure that these callbacks are always called. However, we must also
avoid calling them when we're not ready yet to give up access to the
data. For a more concrete example, consider the user trying to destroy
a pipe that has a pending write operation, while some other thread is
simultaneously performing a memory copy as part of that write
operation. If we invoke the write operation's callback before aborting
the memory copy we're giving the user the right to deallocate the
buffer, which may lead the other thread to segfault.

Here is what needs to happen: when a user deletes a pipe, all its
pending operations must be interrupted, which in turn also aborts the
lower level operations; the pipe's callbacks, however, must not be
fired and instead kept alive while waiting for the lower level
operations to wrap up, and only then they can be triggered. This shows
that a subset of the pipe, containing at least the callbacks, must
survive the destruction of the whole pipe. In other words, the lifetime
of the inner part must be detacheable from the one of the outer shell.

In order to do so, most public objects are just thin wrappers around a
single member field, which is just a pointer to an instance of a
private "implementation" (abbreviated as impl), which is where
everything happens. The impl is a `shared_ptr` so that its life cycle
can be detached and extended with respect to the one of the public
object. The callbacks that we must wait for in order to regain control
of some resource also capture a `shared_ptr`. This way we can still get
the "signal" from when the public object is deleted (and can start
terminating pending operations) but we're also able to keep the impl
around while wait for the shut down to complete.

## Locking

Objects can be accessed and worked on from many threads, from all
directions, above (user threads, higher up the stack) and below 
(low-level backend threads). To avoid race conditions on the internal
state of these object, we must have mutual exclusion between threads,
using locks. While it may be possible to have separate fine-grained
locks for different parts of some objects, in general it is safer
and easier to have one mutex per object, and use it to lock all
operations.

That's easily said, but it just as easily leads to deadlocks, which in
our experience come in two flavors:

- When an object (holding its own lock) calls a "upward" callback which
  (inline/serially) tries to perform an operation on that same object,
  which tries to acquire the same lock. This is a perfectly legitimate
  behavior, since all of our callbacks are "one-shot", that is, they
  "burn out" after they fire and thus must be immediately rearmed.

- When an object (holding its own lock) performs an operation on a
  lower level object, passing a callback to it, and this callback is
  called immediately (inline/serially) and tries to also acquire the
  lock of the first object. This typically happens when the lower level
  object is in an error state and can thus "shortcut" the operation and
  immediately trigger the callback instead of deferring it to a thread.

Mitigations for these problems are possible but none is universal and
they all have drawbacks. Examples are:

- When calling upward callbacks, extract one from the object onto the
  stack, put the object in a consistent state, release its lock and
  then call the callback. This works but there's a racing risk which
  would cause callbacks to not be called in their intended order.

- Have a dedicated thread from which to invoke callbacks. Therefore
  other threads, instead of triggering callbacks, push them to some
  queue that is consumed by this thread. This resembles the semi-future
  and executor pattern. We used to have such a pattern in place for
  calling the pipe callbacks but it was introducing an unacceptable
  latency overhead.

- The backends already typically have a thread they can defer callbacks
  to, and for the most part they already do. However having such a
  thread isn't necessarily a requirement for a transport, and such
  threads may not be running at all times (e.g., once a backend has
  been joined).

- We could replace regular locks with reentrant locks (also called
  recursive). This is typically considered bad practice, though, and
  when at some point we tried this we indeed hit problems.

The next section presents a more disciplined way of dealing with races.

## Event loops

A classic way of dealing with parallel I/O is event loops: repeatedly
polling a set of file descriptors for readability/writability (blocking
to wait for them to become ready), dealing with them, and repeating.
Syscalls to do this are `select`, `epoll`, and more. The `libuv`
library used by one of TensorPipe's transports is also based on an
event loop. Event loops are typically single-threaded, and they allow
to "simulate" parallelism by multiplexing thread if those threads would
spend most of their time doing blocking I/O.

The simplicity of event loops, their single-threaded safety and their
established effectiveness prompted us to make them a foundation of our
threading model.

If an object already has a thread to which it offloads some operations
(this is the case for most transports and some channels, but not the
pipe) then we defer all operations to it. And we really mean all of
them: all manipulation of the object (scheduling operations, querying
information, running callbacks) must be done from within that event
loop thread. All operations that are attempted on the object, either
from another thread or from within the event loop thread (for example,
by a callback in user code) are deferred, appended to a queue, and
dealt with at a later iteration of the loop. This guarantees that we'll
always have a single thread accessing such objects, thus ensuring
thread safety without even using any locks. Note that such design isn't
a requirement for transports, it's just the pattern that we've adopted
for all our current transports.

If, on the other hand, an object does not have access to a thread to
use as an event loop, we'll "borrow" the caller's thread and
temporarily use it as an event loop. We'll similarly have a queue of
tasks, and the thread will consume them one by one, until none are
left, at which point we'll stop occupying the thread and release it
back to the caller. If any new operation is attempted by another thread
while one of these temporary event loops is running, that operation is
added to the queue and thus deferred to the already-running event loop,
with the new thread immediately able to return to what it was doing.