-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathregem.m
764 lines (676 loc) · 30.4 KB
/
regem.m
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
function [X, M, C, Xerr, B, peff, kavlr, kmisr, iptrn] = regem(X, options)
%REGEM Imputation of missing values with regularized EM algorithm.
%
% [X, M, C, Xerr] = REGEM(X, OPTIONS) replaces missing values
% (NaNs) in the data matrix X with imputed values. REGEM
% returns
%
% X: data matrix with imputed values substituted for NaNs,
% M: estimated mean of X,
% C: estimated covariance matrix of X,
% Xerr: estimated mean error of the imputed values.
%
% Missing values are imputed with a regularized expectation
% maximization (EM) algorithm. In an iteration of the EM algorithm,
% given estimates of the mean and of the covariance matrix are
% revised in three steps. First, for each record X(i,:) (row of X, i=1:n)
% with missing values, the regression parameters of the variables with
% missing values on the variables with available values are
% computed from the estimates of the mean and of the covariance
% matrix. Second, the missing values in a record X(i,:) are filled
% in with their conditional expectation values given the available
% values and the estimates of the mean and of the covariance
% matrix, the conditional expectation values being the product of
% the available values and the estimated regression
% coefficients. Third, the mean and the covariance matrix are
% re-estimated, the mean as the sample mean of the completed
% dataset and the covariance matrix as the sum of the sample
% covariance matrix of the completed dataset and an estimate of the
% conditional covariance matrix of the imputation error. As the
% regression models need to be computed only for each unique pattern
% of missing values in records (rows of X), the iterations are go over
% unique missingness patterns. (This is considerably faster than
% iterating over each row of X, e.g., when missingness has a staircase
% pattern with few steps.)
%
% In the regularized EM algorithm, the parameters of the regression
% models are estimated by a regularized regression method. By
% default, the parameters of the regression models are estimated by
% a multiple ridge regression for each record, with one regularization
% parameter (ridge parameter) per record. Optionally, the parameters
% of the regression models can be estimated by an individual ridge
% regression for each missing value, with one regularization parameter
% per missing value. The regularization parameters for the ridge
% regressions are selected as the minimizers of the generalized
% cross-validation (GCV) function, unless they are fixed and a priori
% specified. As another option, the parameters of the regression models
% can be estimated by truncated total least squares (TTLS). The truncation
% parameter, a discrete regularization parameter, can be either fixed
% and given or can be chosen in each iteration by one of a variety of
% selection criteria, from simple (and fast) information-theoretic
% criteria to more reliable (but slower) K-fold cross-validation.
%
% As default initial condition for the imputation algorithm, the
% mean of the data is computed from the available values, mean
% values are filled in for missing values, and a covariance matrix
% is estimated as the sample covariance matrix of the completed
% dataset with mean values substituted for missing
% values. Optionally, initial estimates for the missing values and
% for the covariance matrix estimate can be given as input
% arguments.
%
% To obtain good initial values for regularized EM iterations, the
% regularized EM algorithm may be used with TTLS and a fixed (relatively
% small) truncation parameter. This is very fast, as it requires only
% one eigendecomposition per iteration (instead of one eigendecomposition
% per missingness pattern and iteration with ridge regression,
% or K decompositions with K-fold cross validation).
%
% [X, M, C, Xerr, B, peff, kavlr, kmisr, iptrn] = REGEM(X, OPTIONS)
% additionally returns the cell arrays B, peff, kavlr, kmisr (length equal to
% number of unique missingness patterns) and the vector iptrn (length n),
% which contain
%
% B{j}: Matrix of regression coefficents of missing values on
% available values for each missingness pattern. (This is for
% data centered with the estimate of the mean at the beginning
% of the last iteration, which differs from the mean vector
% that is output by the update after the last iteration.)
% peff{j}: Effective number of degrees of freedom for each
% missingness pattern (and each missing value in pattern)
% kavlr{j}: Indices of available values for each missingness pattern
% kmisr{j}: Indices of missing values for each missingness pattern
% iptrn(i): Index j of missingness pattern for row i of X
%
% The OPTIONS structure specifies parameters in the algorithm:
%
% Field name Parameter Default
%
% OPTIONS.regress Regression procedure to be used: 'mridge'
% 'mridge': multiple ridge regression
% 'iridge': individual ridge regressions
% 'ttls': truncated total least squares
% regression
%
% OPTIONS.stagtol Stagnation tolerance: quit when 1e-2
% consecutive iterates of the missing
% values are so close that
% norm( Xmis(it)-Xmis(it-1) )
% <= stagtol * norm( Xmis(it-1) )
%
% OPTIONS.maxit Maximum number of EM iterations. 50
%
% OPTIONS.inflation Inflation factor for the residual 1
% covariance matrix. Because of the
% regularization, the residual covariance
% matrix underestimates the conditional
% covariance matrix of the imputation
% error. The inflation factor is to correct
% this underestimation. The update of the
% covariance matrix estimate is computed
% with residual covariance matrices
% inflated by the factor OPTIONS.inflation,
% and the estimates of the imputation error
% are inflated by the same factor.
%
% OPTIONS.truncslct Truncation parameter selection with TTLS KCV
% -- Global methods: 'MDL', 'AIC',
% 'AICC', and 'NE08' choose a global
% truncation parameter in each iteration
% according to an information criterion for
% truncating principal component analyses
% (for details, see PCA_TRUNCATION_CRITERIA).
% -- Local method: 'KCV' chooses a
% truncation adaptively for each record
% by K-fold cross-validation.
%
% OPTIONS.Kcv Parameter K for K-fold cross-validation 5
% (only used if OPTIONS.truncslct='KCV')
%
% OPTIONS.cv_norm Norm of error to be minimized in K-fold 2
% cross-validation (only used if
% OPTIONS.truncslct='KCV'). The mean Xerr
% if KCV is chosen is also given in the
% cv_norm.
%
% (only used if OPTIONS.truncslct='KCV')
%
% OPTIONS.ncv Only rows 1:ncv are sequentially left n
% out in cross validation (if
% OPTIONS.truncslct='KCV'). Restricting
% ncv to less than the sample size n may
% be helpful if missing values only or
% primarily occur in rows ncv+1:n of X.
%
% OPTIONS.regpar Regularization parameter. not set
% -- For ridge regression, set regpar to
% sqrt(eps) for mild regularization; leave
% regpar unset for GCV selection of
% regularization parameters.
% -- For TTLS regression, set regpar to a
% number for a fixed truncation parameter.
% Leave regpar unset to choose the
% truncation parameter adaptively by
% the method specified in OPTIONS.truncslct.
% Or specify a vector regpar of possible
% truncation parameters from which an
% an optimal parameter is then chosen
% adaptively by OPTIONS.truncslct.
%
% OPTIONS.scalefac Vector of scale factors with which not set
% variables are to be rescaled prior to
% regression analysis and regularization.
% (Variables will be divided by the scale
% factors). If OPTIONS.scalefac is not set,
% the estimated standard deviations will
% be used as scale factors (so correlation
% matrices will be regularized).
%
% OPTIONS.relvar_res Minimum relative variance of residuals. 5e-2
% From the parameter OPTIONS.relvar_res, a
% lower bound for the regularization
% parameter is constructed if GCV is used,
% to prevent GCV from erroneously choosing
% too small a regularization parameter.
%
% OPTIONS.minvarfrac Minimum fraction of total variation in 0
% standardized variables that must be
% retained in the regularization.
% From the parameter minvarfrac,
% an approximate upper bound for the
% regularization parameter is constructed
% if GCV is used. The default value
% OPTIONS.minvarfrac = 0 corresponds to
% no upper bound for the regularization
% parameter.
%
% OPTIONS.Xmis0 Initial imputed values. Xmis0 is a not set
% (possibly sparse) matrix of the same
% size as X with initial guesses in place
% of the NaNs in X.
%
% OPTIONS.C0 Initial estimate of covariance matrix. not set
% If no initial covariance matrix C0 is
% given but initial estimates Xmis0 of the
% missing values are given, the sample
% covariance matrix of the dataset
% completed with initial imputed values is
% taken as an initial estimate of the
% covariance matrix.
%
% OPTIONS.Xcmp Display the weighted rms difference not set
% between the imputed values and the
% values given in Xcmp, a matrix of the
% same size as X but without missing
% values. By default, REGEM displays
% the rms difference between the imputed
% values at consecutive iterations. The
% option of displaying the difference
% between the imputed values and reference
% values exists for testing purposes.
%
% OPTIONS.neigs Number of eigenvalue-eigenvector pairs not set
% to be computed for TTLS regression.
% By default, all nonzero eigenvalues and
% corresponding eigenvectors are computed.
% By computing fewer (neigs) eigenvectors,
% the computations can be accelerated, but
% the residual covariance matrices become
% inaccurate. The residual covariance
% matrices underestimate the imputation
% error conditional covariance matrices
% more and more as neigs is decreased.
% References:
% [1] T. Schneider, 2001: Analysis of incomplete climate data:
% Estimation of mean values and covariance matrices and
% imputation of missing values. Journal of Climate, 14,
% 853--871.
% [2] R. J. A. Little and D. B. Rubin, 1987: Statistical
% Analysis with Missing Data. Wiley Series in Probability
% and Mathematical Statistics. (For EM algorithm.)
% [3] P. C. Hansen, 1997: Rank-Deficient and Discrete Ill-Posed
% Problems: Numerical Aspects of Linear Inversion. SIAM
% Monographs on Mathematical Modeling and Computation.
% (For regularization techniques, including the selection of
% regularization parameters.)
%narginchk(1, 2) % check number of input arguments
if ndims(X) > 2, error('X must be vector or 2-D array.'); end
% if X is a vector, make sure it is a column vector (representing
% a single variable)
if length(X)==numel(X)
X = X(:);
end
[n, p] = size(X);
% algorithm exploits covariance among variables; it will not work
% with just one variable
if p < 2, error('X must contain at least 2 variables.'); end
% number of degrees of freedom for estimation of covariance matrix
dofC = n - 1; % use degrees of freedom correction
% ============== process options =======================
if nargin ==1 || isempty(options)
fopts = [];
else
fopts = fieldnames(options);
end
% initialize options structure for regression modules
optreg = struct;
% default values
regpar_given = 0==1;
truncslct = 'none';
trial_trunc_given = 0==1;
if ~isempty(strmatch('regress', fopts))
regress = lower(options.regress);
switch regress
case {'mridge', 'iridge'}
if strmatch('regpar', fopts)
regpar_given = 1;
if ~isscalar(options.regpar)
error('Regularization parameter must be a number')
else
optreg.regpar = options.regpar;
regress = 'mridge';
end
end
case {'ttls'}
if strmatch('truncslct', fopts)
truncslct = lower(options.truncslct);
else % default: K-fold CV
truncslct = 'kcv';
end
if ~isempty(strmatch('regpar', fopts)) && isscalar(options.regpar)
% fixed truncation parameter given
truncslct = 'none';
regpar_given = 1==1;
trunc = min([options.regpar, n-1, p]);
elseif ~isempty(strmatch('regpar', fopts)) && isvector(options.regpar)
% vector of trial truncations given
trial_trunc = sort(options.regpar);
trial_trunc_given = 1==1;
else
% default: try all possible truncations
trial_trunc = [];
trial_trunc_given = 0==1;
end
% number of cross-validation folds
if isempty(strmatch('Kcv', fopts))
Kcv = 5;
else
Kcv = options.Kcv;
end
% norm of error to be minimized by K-fold cross-validation
if isempty(strmatch('cv_norm', fopts))
cv_norm = 2;
else
cv_norm = options.cv_norm;
end
% maximum index to be included in cross-validation
if isempty(strmatch('ncv', fopts))
ncv = n; % default: cross-validation over entire sample
else
ncv = options.ncv;
end
if strmatch('neigs', fopts)
neigs = options.neigs;
else
neigs = min(n-1, p);
end
otherwise
error(['Unknown regression method ', regress])
end
else
regress = 'mridge';
end
if strmatch('stagtol', fopts)
stagtol = options.stagtol;
else
stagtol = 1e-2;
end
if strmatch('maxit', fopts)
maxit = options.maxit;
else
maxit = 30;
end
if strmatch('inflation', fopts)
inflation = options.inflation;
else
inflation = 1;
end
if strmatch('scalefac', fopts)
D_given = 1;
D = options.scalefac(:);
if length(D) ~= p
error('OPTIONS.scalefac must be vector of length p.')
end
else
D_given = 0;
end
if strmatch('relvar_res', fopts)
optreg.relvar_res = options.relvar_res;
else
optreg.relvar_res = 5e-2;
end
if strmatch('minvarfrac', fopts)
optreg.minvarfrac = options.minvarfrac;
else
optreg.minvarfrac = 0;
end
if strmatch('Xmis0', fopts);
Xmis0_given= 1==1;
Xmis0 = options.Xmis0;
if any(size(Xmis0) ~= [n,p])
error('OPTIONS.Xmis0 must have the same size as X.')
end
else
Xmis0_given= 0==1;
end
if strmatch('C0', fopts);
C0_given = 1==1;
C0 = options.C0;
if any(size(C0) ~= [p, p])
error('OPTIONS.C0 has size incompatible with X.')
end
else
C0_given = 0==1;
end
if strmatch('Xcmp', fopts);
Xcmp_given = 1==1;
Xcmp = options.Xcmp;
if any(size(Xcmp) ~= [n,p])
error('OPTIONS.Xcmp must have the same size as X.')
end
sXcmp = std(Xcmp);
else
Xcmp_given = 0==1;
end
% ================= end options ========================
% get indices of missing values and initialize matrix of imputed values
indmis = find(isnan(X));
nmis = length(indmis);
if nmis == 0
warning('No missing value flags found.')
return % no missing values
end
[~,kmis] = ind2sub([n, p], indmis);
% use full matrices: needs more memory but is faster
Xmis = X; % matrix of imputed values
Xerr = zeros(size(X)); % standard error imputed vals.
% alternative: sparse matrices to save some memory
%[jmis,kmis] = ind2sub([n, p], indmis);
%Xmis = sparse(jmis, kmis, NaN, n, p); % matrix of imputed values
%Xerr = sparse(jmis, kmis, Inf, n, p); % standard error imputed vals.
% find unique missingness patterns
[np, kavlr, kmisr, prows, mp, iptrn] = missingness_patterns(X);
disp(sprintf('\nREGEM:'))
disp(sprintf('\tSample size: %4i', n))
disp(sprintf('\tUnique missingness patterns: %4i', np))
disp(sprintf('\tPercentage of values missing: %5.2f', nmis/(n*p)*100))
disp(sprintf('\tStagnation tolerance: %9.2e', stagtol))
disp(sprintf('\tMaximum number of iterations: %3i', maxit))
if (inflation ~= 1)
disp(sprintf('\tResidual (co-)variance inflation: %6.3f ', inflation))
end
if Xmis0_given && C0_given
disp(sprintf(['\tInitialization with given imputed values and' ...
' covariance matrix.']))
elseif C0_given
disp(sprintf(['\tInitialization with given covariance' ...
' matrix.']))
elseif Xmis0_given
disp(sprintf('\tInitialization with given imputed values.'))
else
disp(sprintf('\tInitialization of missing values by mean substitution.'))
end
switch regress
case 'mridge'
disp(sprintf('\tOne multiple ridge regression per record:'))
disp(sprintf('\t==> one regularization parameter per record.'))
case 'iridge'
disp(sprintf('\tOne individual ridge regression per missing value:'))
disp(sprintf('\t==> one regularization parameter per missing value.'))
case 'ttls'
disp(sprintf('\tOne total least squares regression per record.'))
end
if D_given
disp(sprintf('\tUsing scaling for variables given by OPTIONS.scalefac.'))
else
disp(sprintf('\tPerforming regularization on correlation matrices.'))
end
switch regress
case {'mridge', 'iridge'}
if regpar_given
disp(sprintf('\tFixed regularization parameter: %9.2e', optreg.regpar))
end
case 'ttls'
if regpar_given
disp(sprintf('\tFixed truncation parameter: %5i', trunc))
else
if strmatch(truncslct, 'kcv', 'exact')
disp(sprintf(['\tTruncation choice: K-fold cross validation (K = ', int2str(Kcv), ')']))
if ncv < n
disp(sprintf(['\tOnly records 1:', int2str(ncv) ' left out in cross-validation.']))
end
disp(sprintf(['\tNorm of CV error to be minimized: ', int2str(cv_norm)]))
else
disp(sprintf(['\tTruncation choice criterion: ', upper(truncslct)]))
end
if trial_trunc_given
disp(sprintf(['\tTruncation parameters restricted between ', ...
int2str(min(trial_trunc)), ' (min) to ', ...
int2str(max(trial_trunc)), ' (max).']))
end
end
end
if Xcmp_given
disp(sprintf(['\n\tIter \tmean(peff) \t|X-Xcmp|/std(Xcmp) ' ...
'\t|D(Xmis)|/|Xmis|']))
else
disp(sprintf(['\n\tIter \tmean(peff) \t|D(Xmis)| ' ...
'\t|D(Xmis)|/|Xmis|']))
end
% initial estimates of missing values
if Xmis0_given
% substitute given guesses for missing values
X(indmis) = Xmis0(indmis);
[X, M] = center(X); % center data to mean zero
else
[X, M] = center(X); % center data to mean zero
X(indmis) = zeros(nmis, 1); % fill missing entries with zeros
end
if C0_given
C = C0;
else
C = X'*X / dofC; % initial estimate of covariance matrix
end
% initialize matrices needed for each missingness pattern
B = cell(np, 1);
S = cell(np, 1);
for j=1:np
S{j} = zeros(length(kmisr{j}), length(kmisr{j}));
end
peff = cell(np, 1);
% get cross-validation partitionings for K-fold CV if needed.
% (Here we fix these partitionings to be invariant from iteration to
% iteration, which should help improve convergence. It is also
% conceivable to use different random partitionings in each iteration.)
if strmatch(truncslct, 'kcv', 'exact')
[incv, outcv, nin] = kcvindices(ncv, Kcv, n-ncv);
end
it = 0;
rdXmis = Inf;
while (it < maxit && rdXmis > stagtol)
it = it + 1;
% initialize for this iteration ...
CovRes = zeros(p, p); % ... residual covariance matrix
peff_ave = 0; % ... average effective number of variables
% scale variables
if D_given
% Strictly speaking, this would not need to be repeated at
% each iteration when D is given, but easier to code that
% way.
[X, C] = rescale(X, C, D);
else
[X, C, D] = rescale(X, C); % scale to unit variance
end
if strmatch(regress, 'ttls')
% compute eigendecomposition of covariance/correlation matrix
[V, d] = peigs(C, neigs);
% compute necessary global quantities for truncation selection criteria
if strmatch(truncslct, 'kcv', 'exact')
Xcv = cell(Kcv, 1);
Mcv = cell(Kcv, 1);
Ccv = cell(Kcv, 1);
Dcv = cell(Kcv, 1);
Vcv = cell(Kcv, 1);
dcv = cell(Kcv, 1);
% compute partitionings needed for K-fold CV (this can and should
% be parallelized to speed it up)
for k=1:Kcv
% get CV sample k
Xcv{k} = X(incv{k}, :);
% re-center CV sample
[Xcv{k}, Mcv{k}] = center(Xcv{k});
% compute covariance matrix estimate for CV sample, consisting
% of sample covariance matrix of completed CV sample plus
% contributions from residual covariance matrices)
% first, collect residual covariance contribution
CovResCv = zeros(p, p);
for i=incv{k}
j = iptrn(i);
CovResCv(kmisr{j}, kmisr{j}) = CovResCv(kmisr{j}, kmisr{j}) ...
+ S{j};
end
% rescale residual covariance matrix contribution
CovResCv = CovResCv ./ repmat(D', p, 1) ./ repmat(D, 1, p);
% assemble covariance matrix esimate for CV sample
Ccv{k} = (Xcv{k}'*Xcv{k} + CovResCv) ./ (nin{k}-1);
if ~D_given
% re-standardize data within CV sample
[Xcv{k}, Ccv{k}, Dcv{k}] = rescale(Xcv{k}, Ccv{k});
else
Dcv{k} = ones(p, 1);
end
% compute eigendecomposition for CV sample
neigscv = min([neigs, nin{k}-1, p]);
[Vcv{k}, dcv{k}] = peigs(Ccv{k}, neigscv);
end
else
% compute global truncation selection criteria
if ~trial_trunc_given
trial_trunc = 0 : length(d)-1;
end
[mdl, ne08, aic, aicc] = pca_truncation_criteria(d, p, trial_trunc, n);
end
end
for j=1:np % cycle over missingness patterns
pm = length(kmisr{j}); % number of missing values in this pattern
if pm > 0
pa = p - pm; % number of available values in this pattern
if pa < 1
error('No available values in at least one row of the data matrix.')
end
% regression of missing variables on available variables
switch regress
case 'mridge'
% one multiple ridge regression per pattern
[B{j}, S{j}, ~, peff{j}] = mridge(C(kavlr{j},kavlr{j}), ...
C(kmisr{j},kmisr{j}), ...
C(kavlr{j},kmisr{j}), n-1, optreg);
peff{j} = peff{j} .* ones(pm, 1);
case 'iridge'
% one individual ridge regression per missing value
[B{j}, S{j}, ~, peff{j}] = iridge(C(kavlr{j},kavlr{j}), ...
C(kmisr{j},kmisr{j}), ...
C(kavlr{j},kmisr{j}), n-1, optreg);
case 'ttls'
% truncated total least squares
if ~regpar_given
if strmatch(truncslct, 'kcv', 'exact')
[trunc, x_rmserr] = kcv_ttls(Vcv, dcv, Dcv, Mcv, kavlr{j}, kmisr{j}, ...
outcv, X, kavlr, iptrn, trial_trunc, cv_norm);
else
imax = find(trial_trunc <= min(pa, length(d)-1), 1, 'last' );
[~, imin] = eval(['min(', truncslct, '(1:imax))']);
trunc = trial_trunc(imin);
end
end
[B{j}, S{j}] = pttls(V, d, kavlr{j}, kmisr{j}, trunc);
peff{j} = trunc .* ones(pm, 1);
end
% missing value estimates
Xmis(prows{j}, kmisr{j}) = X(prows{j}, kavlr{j}) * B{j};
% inflation of residual covariance matrix
S{j} = inflation * S{j};
% restore original scaling to residual covariance matrix
S{j} = S{j} .* repmat(D(kmisr{j})', pm, 1) .* repmat(D(kmisr{j}), 1, pm);
% add up contributions to residual covariance matrix estimate in
% original scaling
CovRes(kmisr{j}, kmisr{j}) = CovRes(kmisr{j}, kmisr{j}) + mp(j)*S{j};
peff_ave = peff_ave + mp(j)*sum(peff{j})/nmis; % add up eff. number of variables
dofS = dofC - peff{j};
if strmatch(truncslct, 'kcv', 'exact')
% use K-fold CV estimate of standard error in imputed values
Xerr(prows{j}, kmisr{j}) = repmat(x_rmserr .* D(kmisr{j})', mp(j), 1);
else
% GCV estimate of standard error in imputed values
Xerr(prows{j}, kmisr{j}) = repmat(( dofC * sqrt(diag(S{j})) ./ dofS)', mp(j), 1);
end
end
end % loop over patterns
% rescale variables to original scaling
X = X .* repmat(D', n, 1);
Xmis = Xmis .* repmat(D', n, 1);
% rms change of missing values
dXmis = norm(Xmis(indmis) - X(indmis)) / sqrt(nmis);
% relative change of missing values
nXmis_pre = norm(X(indmis) + M(kmis)') / sqrt(nmis);
if nXmis_pre < eps
rdXmis = Inf;
else
rdXmis = dXmis / nXmis_pre;
end
% update data matrix X
X(indmis) = Xmis(indmis);
% re-center data and update mean
[X, Mup] = center(X); % re-center data
M = M + Mup; % updated mean vector
% update covariance matrix estimate
C = (X'*X + CovRes)/dofC;
if Xcmp_given
% imputed values in original scaling
Xmis(indmis) = X(indmis) + M(kmis)';
% relative error of imputed values (relative to values in Xcmp)
dXmis = norm( (Xmis(indmis)-Xcmp(indmis))./sXcmp(kmis)' ) ...
/ sqrt(nmis);
disp(sprintf(' \t%3i \t %8.2e \t %10.3e \t %10.3e', ...
it, peff_ave, dXmis, rdXmis))
else
disp(sprintf(' \t%3i \t %8.2e \t%9.3e \t %10.3e', ...
it, peff_ave, dXmis, rdXmis))
end
end % EM iteration
% add mean back to centered data matrix
X = X + repmat(M, n, 1);
% rescale matrix of regression coefficients so that the matrix output is
% that for the regression of available values on missing values in the
% original scaling of variables
for j=1:np;
pm = length(kmisr{j}); % number of missing values in this pattern
if pm > 0
B{j} = diag(1./D(kavlr{j})) * B{j} * diag(D(kmisr{j}));
else
B{j} = NaN;
end
end
% Note that the B matrices are the regression coeffients as of the last
% iteration of the algorithm, *before the final update of the mean and
% covariance matrices.* That is, they are the regression coeffients for
% the data centered with the penultimate estimate of the mean vector, not
% with the final estimate that is returned. With centered data as of the
% beginning of the last iteration,
%
% Xc = X - repmat(M-Mup, n, 1);
%
% the regression coefficients give the imputed values in the original
% scaling
%
% Xc(prows{j}, kmisr{j}) = Xc(prows{j}, kavlr{j}) * B{j}