R: Mixed models: convergence

From MathWiki

Faster convergence

Providing starting values for lme with REML:

If a model will be used repeatedly with small changes to the fixed model but the same random effects model, convergence can quickened by providing starting values to the pdMat object as follows: using

 > fit <- lme( distance ~ age*Sex, Orthodont, random = ~ 1+age |Subject)

1. Get the final value by examining the reStruct object:

> unclass(fit$modelStruct$reStruct1)
[1]  0.60769199 -2.27798198 -0.09190528
attr(,"formula")
~1 + age
attr(,"Dimnames")
attr(,"Dimnames")1
[1] "(Intercept)" "age"         
attr(,"Dimnames")2
[1] "(Intercept)" "age"        

2. Create a pdMat object

 > pd <- pdLogChol( c(.60769199, -2.27798198, -0.09190528), form = ~ 1+age)

3. Fit with 'pd'

 > fit2 <- lme( distance ~ age*Sex, Orthodont, random = list( Subject = pd))




  • From a contribution by Doug Bates suggesting that the log-Cholesky parametrization leads to non-convergence:
>> Can Doug or anyone comment on whether the development work on
>> lme4:::nlmer has included any steps in this direction or not?
>
> Yes.
>
> The algorithm in nlme alternates between solving a linear
> mixed-effects problem to update estimates of the variance components
> and solving a penalized nonlinear least squares problem to update
> estimates of the fixed-effects parameters and our approximation to the
> conditional distribution of the random effects. This type of
> algorithm that alternates between two conditional optimizations is
> appealing because each of the sub-problems is much simpler than the
> general problem. However it may have poor convergence properties. In
> particular it may end up bouncing back and forth between two different
> conditional optima.
>
> Also, at the time we wrote nlme we tried to remove the constraints on
> the variance components by transforming them away (In simple
> situations we iterate on the logarithm of the relative variances of
> the random effects.) This works well except when the estimate of the
> variance component is zero. Trying to reach zero when iterating on
> the logarithm scale can lead to very flat likelihood surfaces.
>
> In the nlmer function I use the same parameterization of the
> variance-covariance of the random effects as in lmer and use the
> Laplace approximation to the log-likelihood. Both of these changes
> should provide more reliable convergence, although the nlmer code has
> not been vetted to nearly the same extent as has the nlme code. In
> other words, I am confident that the algorithm is superior but the
> implementation may still need some work.
>

lmeControl

   lmeControl(maxIter, msMaxIter, tolerance, niterEM, msMaxEval, msTol,
          msScale, msVerbose, returnObject, gradHess, apVar,
          .relStep, minAbsParApVar, nlmStepMax,
          opt = c("nlminb", "optim"), optimMethod,
          natural)

Arguments

maxIter maximum number of iterations for the lme optimization algorithm. Default is 50.

msMaxIter maximum number of iterations for the nlm optimization step inside the lme optimization. Default is 50.

tolerance tolerance for the convergence criterion in the lme algorithm. Default is 1e-6.

niterEM number of iterations for the EM algorithm used to refine the initial estimates of the random effects variance-covariance coefficients. Default is 25.

msMaxEval maximum number of evaluations of the objective function permitted for nlminb. Default is 200.

msTol tolerance for the convergence criterion in nlm, passed as the rel.tolerance argument to the function (see documentation on nlm). Default is 1e-7.

msScale scale function passed as the scale argument to the nlm function (see documentation on that function). Default is lmeScale.

msVerbose a logical value passed as the trace argument to nlm (see documentation on that function). Default is FALSE.

returnObject a logical value indicating whether the fitted object should be returned when the maximum number of iterations is reached without convergence of the algorithm. Default is FALSE.

gradHess a logical value indicating whether numerical gradient vectors and Hessian matrices of the log-likelihood function should be used in the nlm optimization. This option is only available when the correlation structure (corStruct) and the variance function structure (varFunc) have no "varying" parameters and the pdMat classes used in the random effects structure are pdSymm (general positive-definite), pdDiag (diagonal), pdIdent (multiple of the identity), or pdCompSymm (compound symmetry). Default is TRUE.

apVar a logical value indicating whether the approximate covariance matrix of the variance-covariance parameters should be calculated. Default is TRUE.

.relStep relative step for numerical derivatives calculations. Default is .Machine$double.eps^(1/3).

nlmStepMax stepmax value to be passed to nlm. See nlm for details. Default is 100.0 opt the optimizer to be used, either nlminb (the default since (R 2.2.0) or optim (the previous default).

optimMethod character - the optimization method to be used with the optim optimizer. The default is "BFGS". An alternative is "L-BFGS-B".

minAbsParApVar numeric value - minimum absolute parameter value in the approximate variance calculation. The default is 0.05.

natural a logical value indicating whether the pdNatural parametrization should be used for general positive-definite matrices (pdSymm) in reStruct, when the approximate covariance matrix of the estimators is calculated. Default is TRUE.

Value

a list with components for each of the possible arguments.

nlm

Usage

nlm(f, p, ..., hessian = FALSE, typsize = rep(1, length(p)),
   fscale = 1, print.level = 0, ndigit = 12, gradtol = 1e-6,
   stepmax = max(1000 * sqrt(sum((p/typsize)^2)), 1000),
   steptol = 1e-6, iterlim = 100, check.analyticals = TRUE)

Arguments

f the function to be minimized. If the function value has an attribute called gradient or both gradient and hessian attributes, these will be used in the calculation of updated parameter values. Otherwise, numerical derivatives are used. deriv returns a function with suitable gradient attribute. This should be a function of a vector of the length of p followed by any other arguments specified by the ... argument.

p starting parameter values for the minimization.

... additional arguments to f.

hessian if TRUE, the hessian of f at the minimum is returned.

typsize an estimate of the size of each parameter at the minimum.

fscale an estimate of the size of f at the minimum.

print.level this argument determines the level of printing which is done during the minimization process. The default value of 0 means that no printing occurs, a value of 1 means that initial and final details are printed and a value of 2 means that full tracing information is printed.

ndigit the number of significant digits in the function f.

gradtol a positive scalar giving the tolerance at which the scaled gradient is considered close enough to zero to terminate the algorithm. The scaled gradient is a measure of the relative change in f in each direction p[i] divided by the relative change in p[i].

stepmax a positive scalar which gives the maximum allowable scaled step length. stepmax is used to prevent steps which would cause the optimization function to overflow, to prevent the algorithm from leaving the area of interest in parameter space, or to detect divergence in the algorithm. stepmax would be chosen small enough to prevent the first two of these occurrences, but should be larger than any anticipated reasonable step.

steptol A positive scalar providing the minimum allowable relative step length.

iterlim a positive integer specifying the maximum number of iterations to be performed before the program is terminated.

check.analyticals a logical scalar specifying whether the analytic gradients and Hessians, if they are supplied, should be checked against numerical derivatives at the initial parameter values. This can help detect incorrectly formulated gradients or Hessians.

Details

Note that arguments after ... must be matched exactly.

If a gradient or hessian is supplied but evaluates to the wrong mode or length, it will be ignored if check.analyticals = TRUE (the default) with a warning. The hessian is not even checked unless the gradient is present and passes the sanity checks.

From the three methods available in the original source, we always use method “1” which is line search.

The functions supplied must always return finite (including not NA and not NaN) values.

Value

A list containing the following components:

minimum the value of the estimated minimum of f.

estimate the point at which the minimum value of f is obtained.

gradient the gradient at the estimated minimum of f.

hessian the hessian at the estimated minimum of f (if requested).

code an integer indicating why the optimization process terminated.

1: relative gradient is close to zero, current iterate is probably solution.

2: successive iterates within tolerance, current iterate is probably solution.

3: last global step failed to locate a point lower than estimate. Either estimate is an approximate local minimum of the function or steptol is too small.

4: iteration limit exceeded.

5: maximum step size stepmax exceeded five consecutive times. Either the function is unbounded below, becomes asymptotic to a finite value from above in some direction or stepmax is too small. iterations the number of iterations performed.