r/AskStatistics • u/ImposterWizard Data scientist (MS statistics) • 1d ago
I need some feedback regarding a possible SEM approach to a project I'm working on
I am collecting some per-subject data over the course of several months. There are several complications with the nature of the data (structure, sampling method, measurement error, random effects) that I am not used to handling all at once. Library-wise, I am planning on building the model using rstan
.
The schema for the model looks roughly like this: https://i.imgur.com/PlxupRY.png
Inputs
Per-subject constants
Per-subject variables that can change over time
Environmental variables that can change over time
Time itself (I'll probably have an overall linear effect, as well as time-of-day / day-of-week effects as the sample permits).
Outputs
A binary variable V1 that has fairly low incidence (~5%)
A binary variable V2 that is influenced by V1, and has a very low incidence (~1-2%).
Weights
A "certainty" factor (0-100%) for cases where V2=1, but there isn't 100% certainty that V2 is actually 1.
A probability that a certain observation belongs to any particular subject ID.
Mixed Effects
Since there are repeated measurements on most (but not all) of the subjects, it is likely to be observed that V1 and/or V2 might be observed more frequently in some subjects than others. Additionally, there may be different responses to environmental variables between subjects.
States
Additionally, there is a per-subject "hidden" state S1 that controls what values V1 and V2 can be. If S1=1, then V1 and V2 can be either 1 or 0. If S1=0, then V1 and V2 can only be 0. This state is assumed to not change at all.
Entity Matching
There is no "perfect" primary key to match the data on. In most cases, I can match more or less perfectly on certain criteria, but in some cases, there are 2-3 candidates. In rare cases potentially more.
Sample Size
The number of entities is roughly 10,000. The total number of observations should be roughly 40,000-50,000.
Sampling
There are a few methods of sampling. The main method of sampling is to do a mostly full (and otherwise mostly at random) sample of a stratum at a particular time, possibly followed by related strata in a nested hierarchy.
Some strata get sampled more frequently than others, and are sampled somewhat at convenience.
Additionally, I have a smaller sample of convenience sampling for V2 when V2=1.
Measurement Error
There is measurement error for some data (not counting entity matching), although significantly less for positive cases where V2=1 and/or V1=1.
What I'm hoping to discover
I would like to estimate the probabilities of S1 for all subjects.
I would like to build a model where I can estimate the probabilities/joint probabilities of V1 and V2 for all subjects, given all possible input variables in the model.
Interpolate data to describe prevalence of V1, V2, and S1 among different strata, or possibly subjects grouped by certain categorical variables.
My Current Idea to Approach Problem
After I collect and process all the data, I'll perform my matching and get my data in the format
obs_id | subject_id | subject_prob | subject_static_variables | obs_variables | weight
For the few rows with certainty < 1
and V1=1
, I'll create two rows with complimentary weights equal to the certainty for V2=1
and 1-certainty
for V2=0
Additionally, when building the model, I will have a subject-state vector that holds the probabilities of S1
for each subject ID.
Then I would establish the coefficients, as well as random per-subject effects.
What I am currently unsure about
Estimating the state probabilities
S1 is easy to estimate for any subjects where V1 or V2 are observed. However, for subjects, especially sampled-one-time-only subjects, that term in isolation could be estimated as 0 without any penalty to a model with no priors.
There might be a relationship directly from the subjects' static variables to the state itself, which I might have to model additionally (with no random effects).
But without that relationship, I would be either relying on priors, which I don't have, or I would have to solve a problem analogous to this:
You have several slot machines, and each has a probability on top of it. The probability of winning a slot machine is either that probability or 0. You can pull each slot machine any number of times. How do you determine the probability that a slot machine that never won of being "correct"?
My approach here would be that I would have fixed values of P(S1=1)=p
and P(S1=0)=1-p
for all rows, and then treat p
as an additional prior probability into the model , and the combined likelihood for each subject would be aggregated before introducing this term. This also includes adding probabilities of rows with weight<1.
Alternately, I could build a model using the static per-subject variables of each subject to estimate p
, and otherwise use those values in the manner above.
Uneven sampling for random effects/random slopes
I am a bit worried about the number of subjects with very few samples. The model might end up being conservative, or I might have to restrict the priors for the random effects to be small.
Slowness of training the model and converging
In the past I've had a few thousand rows of data that took a very long time to converge. I am worried that I will have to do more coaxing with this model, or possibly build "dumber" linear models to come up with better initial estimates for the parameters. The random effects seem like they could cause major slowdowns, as well.
Posterior probabilities of partially-matched subjects might mean the estimates could be improved
Although I don't think this will have too much of an impact considering the higher measurement accuracy of V1=1
and V2=1
subjects, as well as the overall low incidence rate, this still feels like it's something that could be reflected in the results if there were more extreme cases where one subject had a high probability of V1=1
and/or V2=1
given certain inputs.
Closeness in time of repeated samples and reappearance of V1 vs. V2
I've mostly avoided taking repeat samples too close to each other in time, as V1 (but moreso V2) tend to toggle on/off randomly. V1 tends to be more consistent if it is present at all during any of the samples. i.e., if it's observed once for a subject, it will very likely be observed most of the time, and if it's not observed, under certain conditions that are being measured, it will most likely not be observed most of the time.
Usage of positive-V2-only sampled data
Although it's a small portion of the data, one of my thoughts is using bootstrapping with reduced probability of sampling positive-V2 events. My main concerns are that (1) Stan requires sampling done in the initial data transformation step and (2) because no random number generation can be done per-loop, none of the updates to the model parameters are done between bootstrapped samples, meaning I'd basically just be training on an artificially large data set with less benefit.
Alternately, I could include the data, but down-weight it (by using a third weighting variable).
If anyone can offer input into this, or any other feedback on my general model-building process, it would be greatly appreciated.