On this page we want to give a brief introduction into the algorithmic differentiation (AD) theory. For a detailed introduction please see the book Evaluating Derivatives from Griewank and Walther or The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation from Naumann.

With AD the derivative of computer programs can be computed. It can either be applied to the full program or to a part of the program. Therefore we define a general function

$y = f(x)$

that represents the computer program which should be differentiated. In the definition, $x \in \R^n$ is the input vector with the size and $y \in \R^m$ is the output vector with the size .

Forward AD Equation

The forward AD mode computes now the directional derivative of in the direction $\dot x$ . The equation reads

$\dot y = \frac{df}{dx} \dot x$

where $\dot y$ represents the directional derivative. This equation describes what AD will compute but it does not describe how it is done (the matrix is not setup directly).

The how can be explained by looking at the program evaluation on the CPU. During the compilation process and evaluation, the program is broken down in so called elemental operations $\phi_i : \R^{n_i} \rightarrow \R$ , $w = \phi_i(u)$ where is a vector that contains the arguments of the operation. Usually these are the binary operations like , , etc. and unary functions like , , etc.. For all of these elemental operations, the derivative is known and the program can be seen as big concatenated evaluation of these elemental operations. By applying the chain rule and the directional derivative to the chain of elemental operations, the forward mode AD theory is established. The result is that alongside each elemental operation $\phi_i$ the forward AD equation

$\dot w = \frac{d\phi_i}{du} \dot u$

needs to be evaluated.

The computation is done such that for each primal variable (e.g. ) a tangent variable (e.g. $\dot w$ ) is also allocated. The user can then set the seeding of the tangent variables before the program is run. During the program evaluation, for each elemental operation the corresponding forward AD equation is evaluated. After the program evaluation is finished, the user can get the directional derivative by accessing the tangent values of the outputs.

Reverse AD Equation

The reverse AD mode computes the adjoint directional derivative of in the adjoint direction $\bar y$ . The equation reads

$\bar x = \frac{df}{dx}^T \bar y$

where $\bar x$ represents the adjoint directional derivative. This equation describes what AD will compute but, again, it does not describe how it is done (the matrix is not setup directly).

The how can be explained by the identity $\scalar{\dot y}{\bar y} = \scalar{\frac{df}{dx} \dot x}{\bar y} = \scalar{\dot x}{\frac{df}{dx}^T \bar y} = \scalar{\dot x}{\bar x}$ . It describes that the reverse AD mode is just the discrete adjoint of the forward AD mode. How the discrete adjoint looks is derived in the books above. The result is that for each elemental operation a slightly modified version of the reverse AD equation

$\bar u \aeq \frac{d\phi_i}{du}^T \bar w; \quad \bar w = 0$

needs to be evaluated. First the adjoint variables of the inputs are updated, afterwards the adjoint value of the output is reset to zero. This equation can no longer be evaluated alongside the primal computation. It has to be evaluated for every elemental operation in reverse order, that is from to where $k \in \N$ is the last elemental operation of the primal program.

Since the elemental operations need to be evaluated in reverse order, for each operation the necessary information for the reversal needs to be stored. Therefore, the computation is done such that the user has to declare (register) all input variables. Based on this declaration the AD tool records the necessary information. After the computation is finished the user can set the seeding on $\bar y$ and start the reverse interpretation of the recorded data. This will populate the adjoint sensitivity information in $\bar x$ .

Mathematical naming conventions

The Jacobian of is defined by:

$J = \frac{\d f}{\d x} \in \R^{m \times n}$

The number of rows ( ) represents the number of output variables and the number of columns ( ) represents the number of input variable. The derivative for the i-th output with respect to the j-th input is represented by $J_{i,j}$ .

The Hessian of is defined by:

$H = \frac{\d^2 f}{\d^2 x} \in \R^{m \times n \times n}$

The first dimension ( ) represents the number of output variables, the second and third dimension ( ) represents the first and second derivative with respect to the input variables. The second derivative for the i-th output with respect to the j-th and k-th input is represented by $H_{i,j,k}$ .