Title: | Draw Stratified Samples from the VADIR Database |
---|---|
Description: | Affords researchers the ability to draw stratified samples from the U.S. Department of Veteran's Affairs/Department of Defense Identity Repository (VADIR) database according to a variety of population characteristics. The VADIR database contains information for all veterans who were separated from the military after 1980. The central utility of the present package is to integrate data cleaning and formatting for the VADIR database with the stratification methods described by Mahto (2019) <https://CRAN.R-project.org/package=splitstackshape>. Data from VADIR are not provided as part of this package. |
Authors: | Trevor Swanson [aut, cre], Kelsie Forbush [aut], Joanna Wiese [ctb], Melinda Gaddy [ctb], Mary Oehlert [ctb] |
Maintainer: | Trevor Swanson <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.0.9000 |
Built: | 2025-03-13 03:15:47 UTC |
Source: | https://github.com/tswanson222/samplevadir |
Can be used to identify whether a new version of VADIR contains any old responses. Can also automatically remove repeated responses.
checkData(old, new, fix = FALSE, dates = FALSE)
checkData(old, new, fix = FALSE, dates = FALSE)
old |
Past version of VADIR |
new |
New version of VADIR |
fix |
Logical. Determines whether to automatically remove repeated responses. |
dates |
Logical. Determines whether to include date variables when
comparing datasets. Recommended to keep |
Returns a message that no repeated responses exist if there are none.
Otherwise, returns either a warning that repeated responses exist, or
returns the new VADIR dataset without repeated responses if fix =
TRUE
.
If there are known typos, the correct values of those incorrect responses can be provided and fixed across the dataset.
fixTypos(data, old, new = NULL, var = "RANK_CD")
fixTypos(data, old, new = NULL, var = "RANK_CD")
data |
VADIR dataset |
old |
Character vector containing typos |
new |
Character vector in the same order as |
var |
Variable name for which typos should be corrected |
VADIR dataset with typos corrected
data <- fixTypos(data = VADIR_fake, old = c('CW02', 'CW0-2', 'PV1'), new = c('CWO2', 'CWO2', 'PVT'), var = 'RANK_CD')
data <- fixTypos(data = VADIR_fake, old = c('CW02', 'CW0-2', 'PV1'), new = c('CWO2', 'CWO2', 'PVT'), var = 'RANK_CD')
Allows for easy data importation. Automatically detects filetype and applies appropriate function for importing.
getData(filename, filetype = "csv", fixDates = FALSE, ...)
getData(filename, filetype = "csv", fixDates = FALSE, ...)
filename |
Character string specifying the path to desired datafile |
filetype |
Character string indicating filetype. Useful if no file
extension is provided in |
fixDates |
Logical. Determines whether to adjust date format. |
... |
Additional arguments |
Imported datafile
Serves as a key for relating certain military rank designations with pay
grades. Used in the sampleVADIR
function for stratifying based on pay
grade rather than rank.
rankDat
rankDat
A data frame with six variables that links pay grades to military
ranks within each military branch. PayGrade
indicates the pay grade
associated with a specific job title (Title
) within a given
Branch
of the military. Title
designates the job title, where
Initials
is the shorthand for each title (this is how the
RANK_CD
variable is coded in the VADIR dataset). Branch
designates the military branch, where "N"
stands for Navy, "A"
stands for Army, "M"
stands for Marines, and "F"
stands for
Air Force. PayCat4
represents one coding scheme that categorizes
different pay grades into four categories, where "E"
stands for
enlisted, "NCO"
stands for non-commissioned officer, "W"
stands for warrant officer, and "O"
stands for commissioned officer.
PayCat7
represents an alternative categorization that breaks pay
grades into seven categories, wherein "SNCO"
stands for senior
non-commissioned officer, "FGO"
stands for field grade officer,
"CGO"
stands for company grade officer, and "GO"
stands for
general officer.
The way these data are used in the sampleVADIR
function is by
indexing the values of the RANK_CD
variable of the VADIR dataset
against the Initials
variable in the present dataset, and then the
RANK_CD
value is replaced with the associated value in either the
PayCat4
or PayCat7
variable depending on what is specified in
the sampleVADIR
function. The purpose of this is to make the
RANK_CD
variable more amenable to stratification, given the difficultly
of stratifying across values of a categorical variable with so many unique
values.
Core function used to pull a stratified sample from VADIR based on a variety of parameters.
sampleVADIR( data, n = 4500, vars = "all", rankDat = "rankDat", payRanks = 4, post911 = TRUE, dischargedAfter = FALSE, until = NULL, ageDischarge = TRUE, ageEnlist = FALSE, ageNow = FALSE, yearsServed = FALSE, dateformat = "%m/%d/%Y", params = NULL, formats = "default", typos = list(), rmDeviates = FALSE, timeCats = FALSE, saveData = TRUE, onlyIDs = FALSE, oversample = FALSE, exclude = FALSE, seed = NULL )
sampleVADIR( data, n = 4500, vars = "all", rankDat = "rankDat", payRanks = 4, post911 = TRUE, dischargedAfter = FALSE, until = NULL, ageDischarge = TRUE, ageEnlist = FALSE, ageNow = FALSE, yearsServed = FALSE, dateformat = "%m/%d/%Y", params = NULL, formats = "default", typos = list(), rmDeviates = FALSE, timeCats = FALSE, saveData = TRUE, onlyIDs = FALSE, oversample = FALSE, exclude = FALSE, seed = NULL )
data |
VADIR dataset |
n |
Total desired sample size |
vars |
Character vector indicating which variables to use in stratification |
rankDat |
Dataset linking ranks to pay grade, or character string
indicating where to pull that dataset from. Recommended to leave as
|
payRanks |
Number of pay grades to use when converting rank variable. Only options are either 4 or 7. |
post911 |
Logical. Determines whether to only consider individuals deployed after 9/11/2001 |
dischargedAfter |
Character string indicating what date to restrict
sampling to based on discharge date. Can set to |
until |
Upper limit to when service was started. |
ageDischarge |
Logical. Determines whether to use age at discharge as a stratum. |
ageEnlist |
Logical. Determines whether to use age at enlist as a stratum. |
ageNow |
Logical. Determines whether to use current age as a stratum. |
yearsServed |
Logical. Determines whether to use total years served as a stratum. |
dateformat |
Character string indicating the expected date format. Should be automatically detected. |
params |
Optional list of parameters to override defaults in function. Creates an easy way to interface with the function if performing the stratification multiple times. Allows the user to avoid writing the same arguments multiple times. |
formats |
Should be |
typos |
List containing typos to be fixed, as well as what they should
be changed to. Leave at |
rmDeviates |
Logical. Determines whether rows with unexpected response
values are removed. If |
timeCats |
Logical or numeric. Determines whether the time-related
variables should be treated as categorical variables. If |
saveData |
Logical. Determines whether to save the full dataset in the output. Specifically, returns the full dataset of candidates (i.e., some people may be removed from consideration due to errors or unexpected responses). |
onlyIDs |
Logical. Determines whether to only return ID values for selected individuals rather than a full dataset. |
oversample |
Logical. Determines whether to oversample or undersample based on limitations due to available proportions of strata in subsample. |
exclude |
Logical. Determines whether to exclude people missing a zip
code, as well as people with |
seed |
Numeric value indicating the seed to set for the stratification procedure. Allows for reproducible results. |
Performs stratification separately for males and females, where males and females are sampled at a 1:1 ratio, regardless of population ratio.
With a large dataset (which is typical for VADIR), setting any of the
date-related variables to TRUE
can drastically increase computation
time. The relevant arguments include: ageDischarge, ageEnlist, ageNow,
yearsServed
.
A list containing the males and females who were sampled from VADIR
params <- list( n = 7000, vars = c('PN_Sex_CD', 'PN_BRTH_DT', 'SVC_CD', 'PNL_CAT_CD', 'RANK_CD', 'PNL_TERM_DT', 'PNL_BGN_DT', 'OMB_RACE_CD', 'OMB_ETHNC_NAT_ORIG_CD', 'POST_911_DPLY_IND_CD'), rankDat = 'rankDat', payRanks = 4, post911 = FALSE, until = NULL, dischargedAfter = FALSE, ageDischarge = TRUE, ageEnlist = FALSE, ageNow = FALSE, yearsServed = FALSE, dateformat = '%m/%d/%Y', formats = 'default', rmDeviates = FALSE, timeCats = TRUE, saveData = TRUE, onlyIDs = FALSE, oversample = TRUE, exclude = FALSE, typos = list() ) out <- sampleVADIR(VADIR_fake, params = params, seed = 19)
params <- list( n = 7000, vars = c('PN_Sex_CD', 'PN_BRTH_DT', 'SVC_CD', 'PNL_CAT_CD', 'RANK_CD', 'PNL_TERM_DT', 'PNL_BGN_DT', 'OMB_RACE_CD', 'OMB_ETHNC_NAT_ORIG_CD', 'POST_911_DPLY_IND_CD'), rankDat = 'rankDat', payRanks = 4, post911 = FALSE, until = NULL, dischargedAfter = FALSE, ageDischarge = TRUE, ageEnlist = FALSE, ageNow = FALSE, yearsServed = FALSE, dateformat = '%m/%d/%Y', formats = 'default', rmDeviates = FALSE, timeCats = TRUE, saveData = TRUE, onlyIDs = FALSE, oversample = TRUE, exclude = FALSE, typos = list() ) out <- sampleVADIR(VADIR_fake, params = params, seed = 19)
Used to evaluate the representativeness of the sample with regard to the population. Males and females evaluated separately.
testStrata(out, data = NULL, metric = cor, zeros = FALSE)
testStrata(out, data = NULL, metric = cor, zeros = FALSE)
out |
Output of |
data |
Original VADIR data |
metric |
Function for measuring similarity between population and sample |
zeros |
Should empty strata be included? |
Similarity values for males and females
Simulated VADIR data based solely on the variable names and appropriate
response options for each. Values of variables were generated based on
population proportions identified in a subsample of approximately 140,000
veterans from a version of the VADIR database obtained in 2020. However, this
simulated dataset does not fully represent population characteristics of
VADIR, and is simply meant as a faux tool for testing functions in the
sampleVADIR
package.
VADIR_fake
VADIR_fake
A data frame with ten variables, representing variables as they are formatted within the actual VADIR database.