Imputation (`mice`)

The main function of the package is mice, which takes a Tables.jl-compatible table as its input. It returns a multiply imputed dataset (Mids) object with the imputed values.

Mice.Mids — Type

Mids

A multiply imputed dataset object.

The data originally supplied are stored as data.

The imputed data are stored as imputations (one column per imputation).

The locations at which data have been imputed are stored as imputeWhere.

The number of imputations is stored as m.

The imputation method for each variable is stored as methods.

The predictor matrix is stored as predictorMatrix.

The order in which the variables are imputed is stored as visitSequence.

The number of iterations is stored as iter.

The mean of each variable across the imputations is stored as meanTraces.

The variance of each variable across the imputations is stored as varTraces.

source

Mice.mice — Function

mice(
    data;
    m::Int = 5,
    imputeWhere::Union{NamedVector{Vector{Bool}}, Nothing} = nothing,
    visitSequence::Union{Vector{String}, Nothing} = nothing,
    methods::Union{NamedVector{String}, Nothing} = nothing,
    predictorMatrix::Union{NamedMatrix{Bool}, Nothing} = nothing,
    iter::Int = 10,
    progressReports::Bool = true,
    gcSchedule::Float64 = 1.0,
    threads::Bool = false,
    kwargs...
    )

Imputes missing values in a dataset using the MICE algorithm. The output is a Mids object.

The data containing missing values (data) must be supplied as a Tables.jl table.

The number of imputations created is specified by m.

imputeWhere is a NamedVector of boolean vectors specifying where data are to be imputed. The default is to impute all missing data.

The variables will be imputed in the order specified by visitSequence. The default is sorted by proportion of missing data in ascending order; the order can be customised using a vector of variable names in the desired order. Any column not to be imputed at all can be left out of the visit sequence.

The imputation method for each variable is specified by the NamedArray methods. The default is to use predictive mean matching (pmm) for all variables. Currently only pmm is supported. Any variable not to be imputed can be marked as such using an empty string ("").

The predictor matrix is specified by the NamedArray predictorMatrix. The default is to use all other variables as predictors for each variable. Any variable not predicting another variable can be marked as such in the matrix using a 0.

The number of iterations is specified by iter.

If progressReports is true, a progress indicator will be displayed in the console.

gcSchedule dictates when the garbage collector will be (additionally) invoked. The number provided is the fraction of your RAM remaining at which the GC will be called. For small datasets, you may get away with a value of 0.0 (never called), but for larger datasets, it may be worthwhile to call it more frequently. The default is to call it after each iteration of each variable (1.0), but this may negatively affect performance if it is not necessary for your dataset.

threads dictates whether multi-threading will be used. This will improve performance for larger jobs if and only if Julia has been launched with multiple threads (which you can verify by calling Threads.nthreads()). The default is false.

source

mice(
    mids::Mids;
    iter::Int = 10,
    progressReports::Bool = true,
    gcSchedule::Float64 = 1.0,
    threads::Bool = false,
    kwargs...
    )

Adds additional iterations to an existing Mids object.

The number of additional iterations is specified by iter.

progressReports, gcSchedule and threads can also be specified: all other arguments will be ignored.

source

Customising the imputation setup

You can customise various aspects of the imputation setup by passing keyword arguments to mice. These are described above. You can also use some of the functions below to define objects that you can customise to alter how mice handles the imputation.

Locations to impute

You can customise which data points are imputed by manipulating the imputeWhere argument. By default, this will specify that all missing data are to be imputed (using the function findMissings()).

Mice.findMissings — Function

findMissings(data)

Returns a named vector of boolean vectors describing the locations of missing data in each column of the provided data table.

source

You can over-impute existing data by setting the locations of non-missing data to true in the relevant vector in imputeWhere. For example, to impute all data points in the variable col1 (even those that are not missing), you could do the following:

using DataFrames, Mice, Random

myData = DataFrame(
    :col1 => Vector{Union{Missing, Float64}}([1.0, missing, 3.0, missing, 5.0]),
    :col2 => Vector{Union{Missing, Int64}}([1, 2, missing, 4, 5]),
    :col3 => Vector{Union{Missing, String}}([missing, "2", missing, "4", missing])
);

myImputeWhere = findMissings(myData)
# 3-element Named Vector{Vector{Bool}}
# A    |
# -----|--------------------
# col1 | Bool[0, 1, 0, 1, 0]
# col2 | Bool[0, 0, 1, 0, 0]
# col3 | Bool[1, 0, 1, 0, 1]

myImputeWhere["col1"][:] .= true;
myImputeWhere
# 3-element Named Vector{Vector{Bool}}
# A    |
# -----|--------------------
# col1 | Bool[1, 1, 1, 1, 1]
# col2 | Bool[0, 0, 1, 0, 0]
# col3 | Bool[1, 0, 1, 0, 1]

# Not run
mice(myData, imputeWhere = myImputeWhere)

Visit sequence

The visit sequence is the order in which the variables are imputed. By default, mice sorts the variables in order of missingness (lowest to highest) via the internal function makeMonotoneSequence. You can instead define your own visit sequence by creating a vector of variable names in your desired order and passing that to mice. For example:

using DataFrames, Mice, Random

myData = DataFrame(
    :col1 => Vector{Union{Missing, Float64}}([1.0, missing, 3.0, missing, 5.0]),
    :col2 => Vector{Union{Missing, Int64}}([1, 2, missing, 4, 5]),
    :col3 => Vector{Union{Missing, String}}([missing, "2", missing, "4", missing])
);

Mice.makeMonotoneSequence(myData)
# 3-element Vector{String}:
# "col2"
# "col1"
# "col3"

myVisitSequence1 = names(myData)
# 3-element Vector{String}:
# "col1"
# "col2"
# "col3"

Random.seed!(1234); # Set random seed for reproducibility

# Not run
mice(myData, visitSequence = myVisitSequence1)

myVisitSequence2 = ["col3", "col1", "col2"]
# 3-element Vector{String}:
# "col3"
# "col1"
# "col2"

# Not run
mice(myData, visitSequence = myVisitSequence2)

Assuming that the imputations converge normally, changing the visit sequence should not dramatically affect the output. However, it can be useful to change the visit sequence if you want to impute variables in a particular order for a specific reason. The sequence used by default in Mice.jl can make convergence faster in cases where the data follow a (near-)"monotone" missing data pattern [2].

You can leave variables out of the visitSequence to cause mice() to not impute them.

Predictor matrix

The predictor matrix defines which variables in the imputation model are used to predict which others. By default, every variable predicts every other variable, but there are a wide range of cases in which this is not desirable. For example, if your dataset includes an ID column, this is clearly useless for imputation and should be ignored.

To create a default predictor matrix that you can edit, you can use the function makePredictorMatrix.

Mice.makePredictorMatrix — Function

makePredictorMatrix(data)

Returns a named matrix of booleans defining the predictors for each variable in data. The variables to be predicted are on the rows, and the predictors are on the columns. The default is to use all variables as predictors for all other variables (i.e. all 1s except for the diagonal, which is 0).

source

You can then edit the predictor matrix to remove any predictive relationships that you do not want to include in the imputation model. For example:

using DataFrames, Mice, Random

myData = DataFrame(
    :id => Vector{Int64}(1:5),
    :col1 => Vector{Union{Missing, Float64}}([1.0, missing, 3.0, missing, 5.0]),
    :col2 => Vector{Union{Missing, Int64}}([1, 2, missing, 4, 5]),
    :col3 => Vector{Union{Missing, String}}([missing, "2", missing, "4", missing])
);

myPredictorMatrix = makePredictorMatrix(myData)
# 4x4 Named Matrix{Bool}
# A \ B |    id   col1   col2   col3
# ------|---------------------------
# id    | false   true   true   true
# col1  |  true  false   true   true
# col2  |  true   true  false   true
# col3  |  true   true   true  false

# To stop the ID column from predicting any other variable
myPredictorMatrix[:, "id"] .= false;
myPredictorMatrix
# 4x4 Named Matrix{Bool}
# A \ B |    id   col1   col2   col3
# ------|---------------------------
# id    | false   true   true   true
# col1  | false  false   true   true
# col2  | false   true  false   true
# col3  | false   true   true  false

# To stop col1 from predicting col3
myPredictorMatrix["col3", "col1"] = false;
myPredictorMatrix
# 4x4 Named Matrix{Bool}
# A \ B |    id   col1   col2   col3
# ------|---------------------------
# id    | false   true   true   true
# col1  | false  false   true   true
# col2  | false   true  false   true
# col3  | false  false   true  false

Random.seed!(1234); # Set random seed for reproducibility

# Not run
mice(myData, predictorMatrix = myPredictorMatrix)

Methods

The imputation methods are the functions that are used to impute each variable. By default, mice uses predictive mean matching ("pmm") for all variables (and currently PMM is the only method that Mice.jl supports).

To create a default methods vector, use the function makeMethods.

Mice.makeMethods — Function

makeMethods(data)

Returns a named vector of strings defining the method by which each variable in data should be imputed in the mice() function. The default (and only supported) method is predictive mean matching (pmm).

source

You can then customise the vector as needed. For example:

using DataFrames, Mice, Random

myData = DataFrame(
    :id => Vector{Int64}(1:5),
    :col1 => Vector{Union{Missing, Float64}}([1.0, missing, 3.0, missing, 5.0]),
    :col2 => Vector{Union{Missing, Int64}}([1, 2, missing, 4, 5]),
    :col3 => Vector{Union{Missing, String}}([missing, "2", missing, "4", missing])
);

myMethods = makeMethods(myData)
# 4-element Named Vector{String}
# A    |
# -----|------
# id   | "pmm"
# col1 | "pmm"
# col2 | "pmm"
# col3 | "pmm"

# To stop the ID column from being imputed (but you can also achieve this by leaving "id"
# out of the visit sequence)
myMethods["id"] = "";
myMethods
# 4-element Named Vector{String}
# A    |
# -----|------
# id   |    ""
# col1 | "pmm"
# col2 | "pmm"
# col3 | "pmm"

Random.seed!(1234); # Set random seed for reproducibility

# Not run
mice(myData, methods = myMethods)

Diagnostics

After performing multiple imputation, you should inspect the trace plots of the imputed variables to verify convergence. Mice.jl includes the a plotting function to do this.

RecipesBase.plot — Function

plot(
    mids::Mids,
    var::String
    )

Plots the mean and standard deviation of the imputed values for a given variable. Here var is given as a string (the name of the variable).

source

plot(
    mids::Mids,
    var_no::Int
    )

Plots the mean and standard deviation of the imputed values for a given variable. Here var_no is given as an integer (the index of the variable in the visitSequence).

source

You do need to load the package Plots.jl to see the plots:

using Plots

# Not run
plot(myMids, 7)

Binding imputations together

If you have a number of Mids objects that were produced in the same way (e.g. through multithreading), you can bind them together into a single Mids object using the function bindImputations. Note that the log of events might not make sense in the resulting object: it is better to inspect the logs of the individual objects before binding them together.

Mice.bindImputations — Function

bindImputations(
    mids1::Mids,
    mids2::Mids
    )

Combines two Mids objects into one. The two objects must have been created from the same dataset, with the same imputation methods, predictor matrix, visit sequence and number of iterations. The numbers of imputations can be different.

source

bindImputations(
    midsVector::Vector{Mids}
    )

Combines a vector of Mids objects into one Mids object. They must all have been created from the same dataset with the same imputation methods, predictor matrix, visit sequence and number of iterations. The number of imputations can be different.

source

bindImputations(
    mids...
    )

Combines any number of Mids objects into one Mids object. They must all have been created from the same dataset with the same imputation methods, predictor matrix, visit sequence and number of iterations. The number of imputations can be different.

source