In this example we'll compare the optimization of a relatively simple problem: Logistic Regression.
Consider a binary classification dataset .
The logistic regression problem defines a model , where . The value of is found by finding the maximum of the log-likelihood function
We'll input the gradient of this function manually. It is given by
where is the vector with all components equal to 1.
Download data.csv to follow the example.
In Python, we can define the function to minimize, and its derivatives, as follows:
import numpy as np
import pandas as pd
from scipy.optimize import minimize
df = pd.read_csv('data.csv')
X = df.iloc[:,0:-1].values
y = df.iloc[:,-1].values
n, p = X.shape
def sigmoid(t):
return 1 / (1 + np.exp(-t))
def myfun(beta):
hbeta = sigmoid(beta[0] + X.dot(beta[1:]))
return -np.sum(
y * np.log(hbeta + 1e-8) + (1 - y) * np.log(1 - hbeta + 1e-8)
) / n
def myjac(beta):
hbeta = sigmoid(beta[0] + X.dot(beta[1:]))
dif = hbeta - y
return np.concatenate((np.array([np.sum(dif)]), X.T.dot(dif))) / n
beta0 = np.zeros(p + 1)
res = minimize(myfun, beta0, jac=myjac, method='l-bfgs-b', tol=1e-4)
beta = res.x
hbeta = sigmoid(beta[0] + X.dot(beta[1:]))
ypred = np.round(hbeta)
accuracy = sum(y == ypred) / n
print('accuracy = ', accuracy)
If you're using Jupyter, I also ask you to run
%%timeit
res = minimize(myfun, beta0, jac=myjac, method='l-bfgs-b', tol=1e-4)
So that we can compare the time
Here's a similar Julia version
using CSV, DataFrames, Optim, LinearAlgebra
path = joinpath("assets", "data.csv")
df = DataFrame(CSV.File(path))
X = Matrix(df[:,1:end-1])
y = df[:,end]
n, p = size(X)
sigmoid(t) = 1 / (1 + exp(-t))
function myfun(β, X, y)
@views hβ = sigmoid.(β[1] .+ X * β[2:end])
out = sum(
yᵢ * log(ŷᵢ + 1e-8) + (1 - yᵢ) * log(1 - ŷᵢ + 1e-8)
for (yᵢ, ŷᵢ) in zip(y, hβ)
)
return -out / length(y)
end
function myjac(out, β, X, y)
n = length(y)
@views δ = (sigmoid.(β[1] .+ X * β[2:end]) - y) / n
out[1] = sum(δ)
out[2:end] .= X' * δ
return out
end
output = optimize(
β -> myfun(β, X, y),
(out, β) -> myjac(out, β, X, y),
zeros(p + 1),
LBFGS(),
Optim.Options(
g_tol = 1e-4,
),
)
β = Optim.minimizer(output)
hβ = sigmoid.(β[1] .+ X * β[2:end])
ŷ = round.(hβ)
accuracy = sum(y .== ŷ) / n
0.986
The beginning of the codes is similar: read the file, create X
and y
.
We then define the functions. Some main differences:
Instead of writing the objective vectorized, I use a loop over the indices.
To avoid creating a vector for β[2:end]
, I use @views
at the beginning of the line. The macro automatically transforms β[2:end]
into view(β, 2:end)
.
The Julia myjac
function receives an additional argument for the memory that it should reuse (gx
) for the output.
Now, the specific part. In Python, we use SciPy's optimize, and in Julia we're using Optim.jl. The arguments are similar, although Optim's arguments are positional.
We can benchmark Optim's code to compare against SciPy's optimize:
using BenchmarkTools
β₀ = zeros(p + 1)
@benchmark optimize(
β -> myfun(β, X, y),
(out, β) -> myjac(out, β, X, y),
β₀,
LBFGS(),
Optim.Options(
g_tol = 1e-4,
),
)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 410.716 ms … 453.499 ms ┊ GC (min … max): 0.00% … 2.98%
Time (median): 427.842 ms ┊ GC (median): 0.00%
Time (mean ± σ): 426.335 ms ± 12.937 ms ┊ GC (mean ± σ): 0.45% ± 1.02%
█▁ ▁ ▁ ▁ ▁ █ ▁ ▁ ▁
██▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
411 ms Histogram: frequency by time 453 ms <
Memory estimate: 30.37 MiB, allocs estimate: 1431.
In Julia, we don't have a single optimization library. In fact, I'm co-creator of the JuliaSmoothOptimizers organization, which "competes" against Optim. We focus on tools for Nonlinear Optimization developers, instead of focusing on users, so our basic interface is a lot more raw. On the other hand, we claim that our methods are faster, but notice my bias.
NLPModels alone doesn't have a simple user interface, so we create a struct MyLogisticRegression
to hold the problem and to indicate the objective and gradient. The struct fields are not obvious and you have to follow the docs of NLPModels to learn how to do it. But you can check the speed gain.
using JSOSolvers, Logging, NLPModels
struct MyLogisticRegression <: AbstractNLPModel{Float64, Vector{Float64}}
meta :: NLPModelMeta{Float64, Vector{Float64}}
counters :: Counters
X
y
function MyLogisticRegression(X, y)
meta = NLPModelMeta(p+1, x0=zeros(p+1))
return new(meta, Counters(), X, y)
end
end
NLPModels.obj(nlp :: MyLogisticRegression, x) = myfun(x, nlp.X, nlp.y)
NLPModels.grad!(nlp :: MyLogisticRegression, x, gx) = myjac(gx, x, nlp.X, nlp.y)
nlp = MyLogisticRegression(X, y)
output = lbfgs(nlp, atol=1e-4, rtol=1e-4)
println(output)
@benchmark with_logger(NullLogger()) do
lbfgs(nlp, atol=1e-4, rtol=1e-4)
end
Generic Execution stats
status: first-order stationary
objective value: 0.035504781916813
primal feasibility: 0.0
dual feasibility: 0.00011923338501061753
solution: [-0.008672859839379604 -0.7833754246517539 -0.09849721612105959 -0.2561961749800854 ⋯ -0.04372711407863323]
iterations: 30
elapsed time: 1.0523371696472168
BenchmarkTools.Trial: 25 samples with 1 evaluation.
Range (min … max): 193.638 ms … 216.649 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 200.720 ms ┊ GC (median): 0.00%
Time (mean ± σ): 202.070 ms ± 6.841 ms ┊ GC (mean ± σ): 0.39% ± 1.29%
█ ▃
▇█▁▇▇▁▁█▁▁▁▇▁▇▁▁▇▇▇▁▇▁▁▇▁▁▁▁▁▇▇▇▁▇▇▁▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁▇▁▁▁▇▁▁▁▁▇ ▁
194 ms Histogram: frequency by time 217 ms <
Memory estimate: 14.42 MiB, allocs estimate: 918.
TODO: Check progress of ManualNLPModels (check after 18-Oct)
One noteworthy aspect of this comparison is that Python uses the L-BFGS-B Fortran algorithm, which is a reference implementation of a classic method. The Julia versions are pure Julia.