Chapter 3 Optimization
3.1 Line search
3.1.1 Methodology
The goal of Exercise 3.1 in Nocedal and Wright (2006) is to minimize the bivariate Rosenbrock function (Equation (3.1)) using steepest descent and Newton’s method. The Rosenbrock function - also known as Rosenbrock’s banana function - has a long, narrow, parabolic shaped flat valley and is often used for to test optimization algorithms for their performance (see here).
\[ \begin{equation} f(\mathbb{x})=100(x_2-x_1^2)^2+(1-x_1)^2 \tag{3.1} \end{equation} \] We can implement Equation (3.1) in R as follows:
# Rosenbrock:
= function(X) {
f 100 * (X[2] - X[1]^2)^2 + (1 - X[1])^2
}
Figure 3.1 shows the output of the function over \(x_1,x_2 \in [-1.5, 1.5]\) along with its minimum indicated as a red asterisk and the two starting points: (1) \(X_0=(1.2,1.2)\) and (2) \(X_0=(-1.2,1)\).

Figure 3.1: Output of the Rosenbrock function and minimizer in red.
The gradient and Hessian of \(f\) can be computed as
\[ \begin{equation} \nabla f(\mathbb{x})= \begin{pmatrix} \frac{ \partial f}{\partial x_1} \\ \frac{ \partial f}{\partial x_2} \end{pmatrix} \tag{3.2} \end{equation} \]
and
\[ \begin{equation} \nabla^2 f(\mathbb{x})= \begin{pmatrix} \frac{ \partial^2 f}{\partial x_1^2} & \frac{ \partial^2 f}{\partial x_1 \partial x_2} \\ \frac{ \partial^2 f}{\partial x_2\partial x_1} & \frac{ \partial^2 f}{\partial x_2^2} \end{pmatrix} \tag{3.3} \end{equation} \]
which in R can be encoded as follows:
# Gradient:
= function(X) {
df = rep(0, length(X))
df 1] = -400 * X[1]^2 * (X[2] - X[1]^2) - 2 * (1 - X[1]) # partial with respect to x_1
df[2] = 200 * (X[2] - X[1]^2)
df[return(df)
}
# Hessian:
= function(X) {
ddf = matrix(nrow = length(X), ncol=length(X))
ddf 1,1] = 1200 * X[1]^2 - 400 * X[2] + 2 # partial with respect to x_1
ddf[2,1] = ddf[1,2] = -400 * X[1]
ddf[2,2] = 200
ddf[return(ddf)
}
For both methods I will use the Arminjo condition with backtracking. The gradient_desc
function (below) can implement both steepest descent and Newton’s method. The code for the function can be inspected below. There’s also a small description of the different arguments.
function (f, df, X0, step_size0 = 1, ddf = NULL, method = "newton", c = 1e-04, remember = TRUE, tau = 1e-05, backtrack_cond = "arminjo", max_iter = 10000)
{= matrix(X0)
X_latest if (remember) {
= matrix(X0, ncol = length(X0))
X = matrix(ncol = length(X0))
steps
}= 0
iter if (method == "steepest") {
= function(X) {
B diag(length(X))
}
}else if (method == "newton") {
= tryCatch(ddf, error = function(e) {
B stop("Hessian needs to be supplied for Newton's method.")
})
}if (backtrack_cond == "arminjo") {
= function(alpha) {
sufficient_decrease return(f(X_k + alpha * p_k) <= f(X_k) + c * alpha * t(df_k) %*% p_k)
}
}else if (is.na(backtrack_cond)) {
= function(alpha) {
sufficient_decrease return(f(X_k + alpha * p_k) <= f(X_k))
}
}while (any(abs(df(X_latest) - rep(0, length(X_latest))) > tau) & iter < max_iter) {
= X_latest
X_k = step_size0
alpha = matrix(df(X_latest))
df_k = B(X_latest)
B_k = qr.solve(-B_k, df_k)
p_k while (!sufficient_decrease(alpha)) {
= alpha/2
alpha
}= X_latest + alpha * p_k
X_latest = iter + 1
iter if (remember) {
= rbind(X, t(X_latest))
X = rbind(steps, t(alpha * p_k))
steps
}
}if (iter >= max_iter)
warning("Reached maximum number of iterations without convergence.")
= list(optimal = X_latest, visited = tryCatch(X, error = function(e) NULL), steps = tryCatch(steps, error = function(e) NULL), X0 = X0, method = method)
output return(output)
}
Similarly you can take a look at how the gradient_desc
is applied in the underlying problem by unhiding the next code chunk.
library(fromScratchR)
= 1:nrow(X0)
init_guesses = c("steepest","newton")
algos = expand.grid(guess=init_guesses,algo=algos)
grid = lapply(
X_star 1:nrow(grid),
function(i) {
gradient_desc(
f=f,df=df,
X0=X0[grid[i,"guess"],c(x1,x2)],
method = grid[i,"algo"]
)
}
)# Tidy up
= rbindlist(
X_star_dt lapply(
1:length(X_star),
function(i) {
= data.table(X_star[[i]]$visited)
dt :=X_star[[i]]$method]
dt[,methodc("x0_1","x0_2")):=.(X_star[[i]]$X0[1],X_star[[i]]$X0[2])]
dt[,(:=.I-1]
dt[,iteration:=f(c(V1,V2)),by=.(1:nrow(dt))]
dt[,y
}
) )
3.1.2 Results
The below shows how the two algorithms converge to

Figure 3.2: Good initial guess.

Figure 3.3: Poor initial guess.