De-identify machine extraction data
fileDeID.Rd
This function is used to de-identify a flat file using a crosswalk (with
specific structure, see loadxwalks()
).
De-identification options include replacing a patient MRN by a tokenized MRN
(using the crosswalk), removing a column, 'blanking' a column
(the column remains in the dataset but all values are ""), and shifting a
date or datetime variable (using the crosswalk).
Usage
fileDeID(
filetodeid,
fd_varname_mrn = "PAT_MRN",
variablestoremove = character(0),
variablestoblank = character(0),
datevariablestodateshift = character(0),
dateformat = "%Y-%m-%d",
datetimevariablestodateshift = character(0),
datetimeformat = "%Y-%m-%d %H:%M:%OS",
separator = ",",
separator_out = ",",
xwalk,
compare_mrn_numeric = attr(xwalk, "compare_mrn_numeric"),
outputfile = NULL,
usefread = TRUE,
verbose = 0L
)
Arguments
- filetodeid
character. Filename of a single flat file to de-identify.
- fd_varname_mrn
character. Name of MRN variable in file to de-identify. Default is "PAT_MRN".
- variablestoremove
character vector. Names of variables in extraction file to remove. Default is
character(0)
, no variables to remove.- variablestoblank
character vector. Names of variables in extraction file to blank. The variable will remain in the output file but will be ” for all rows. Default is
character(0)
, no variables to blank.- datevariablestodateshift
character vector. Names of variables that are dates and should be date-shifted. Default is
character(0)
, no variables to shift.- dateformat
character. Format of date variables in the extraction file. Default is "%Y-%m-%d" corresponding to 4-digit year, hyphen, 2-digit month, hyphen, 2-digit day.
- datetimevariablestodateshift
character vector. Names of variables that are datetimes and should be date-shifted. Default is
character(0)
, no variables to shift.- datetimeformat
character. Format of datetime variables in the extraction file. Default is "%Y-%m-%d %H:%M:%OS" corresponding to 4-digit year, hyphen, 2-digit month, hyphen, 2-digit day, space, 2-digit (24) hour, colon, 2-digit minute, colon 2-digit second.
- separator
character. Field separator in filetodeid (input file). Default is
","
.- separator_out
character. Field separator to use in outputfile. Default is
","
.- xwalk
data.frame containing crosswalk information. Usually the output from
loadxwalks()
.- compare_mrn_numeric
logical. Should MRNs be compared as numeric variables? Usually this is a good idea because leading 0s may have been dropped during processing. Default is whatever was used to create xwalk, which is
TRUE
by default.- outputfile
character. Name of file to write de-identified data. If
NULL
(the default) the data are not written, but only returned from the function. An additional option is the special value"SOURCE_"
, which causes the output to be written to the same filename as the input but prepended with "SOURCE_".- usefread
If
TRUE
(default), usedata.table::fread()
anddata.table::fwrite()
. IfFALSE
, useutils::read.csv()
andutils::write.csv()
.TRUE
is usually preferable asFALSE
results in double quotes around almost all values when producing the output file.- verbose
integer. Higher values produce more output to console. Default is 0, no output.
Value
(invisibly) data.frame (even if usefread == TRUE
) with variables tokenized, date-shifted, removed, and/or blanked, as requested.
Examples
dataloc <- system.file("extdata", package = "DeIDmachinedata")
fn1 <- sprintf("%s/xwalk1.csv", dataloc)
fn2 <- sprintf("%s/xwalk2.csv", dataloc)
xwalk <- loadxwalks(tokenfile = fn1, dateshiftfile = fn2)
fn3 <- sprintf("%s/pentacam_UCH.csv", dataloc)
deidfile <- fileDeID(
filetodeid = fn3,
fd_varname_mrn = "Pat-ID:",
variablestoremove = c("Last Name:", "First Name:", "D.o.Birth:"),
variablestoblank = "Exam Comment:",
datevariablestodateshift = "Exam Date:",
dateformat = "%m/%d/%Y",
xwalk = xwalk,
outputfile = NULL,
verbose = 2)
#> Processing /home/runner/work/_temp/Library/DeIDmachinedata/extdata/pentacam_UCH.csv
#> /home/runner/work/_temp/Library/DeIDmachinedata/extdata/pentacam_UCH.csv Variable Names
#> [1] "Last Name:" "First Name:" "Pat-ID:" "D.o.Birth:"
#> [5] "Exam Date:" "Exam Time:" "Exam Eye:" "Exam Type:"
#> [9] "Exam Comment:" "Status" "Error" "Rf F (mm):"
#> [13] "Rs F (mm):" "Rh F (mm):" "Rv F (mm):" "K1 F (D):"
#> [17] "K2 F (D):" "Rm F (mm):" "Km F (D):" "Axis F (flat):"
#> [21] "Astig F (D):" "R Per F (mm)" "R Min (mm)"
#> 3 rows read from /home/runner/work/_temp/Library/DeIDmachinedata/extdata/pentacam_UCH.csv
#> Pat-ID: Exam Date: Exam Time: Exam Eye: Exam Type: Exam Comment:
#> 1 2 1/2/2020 123 OS penta PII
#> 2 2 1/31/2020 234 OS penta PII
#> 3 9 12/31/1999 321 OD penta PII
#> Status Error Rf F (mm): Rs F (mm): Rh F (mm): Rv F (mm):
#> 1 Good None 9.3 <NA> 4.9 <NA>
#> 2 Good, Really Good! None 9.3 8.7 4.1 6.3
#> 3 Bad Yes 9 8.7 <NA> 6.3
#> K1 F (D): K2 F (D): Rm F (mm): Km F (D): Axis F (flat): Astig F (D):
#> 1 50 60 8.1 35 75 10
#> 2 40 50 8 30 80 30
#> 3 0 0 7.9 10 20 30
#> R Per F (mm) R Min (mm)
#> 1 5 9
#> 2 5.5 10
#> 3 1.1 6
#> Summary of MRN matching index
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 2 2 2 2 2 2 1
#> Number of patients with each number of test (original MRN)
#>
#> 1 2
#> 1 1
#> Number of patients with each number of test (tokenized MRN)
#>
#> 1 2
#> 1 1
#> 1 tokenized mrns missing of 3
#> Shifting Exam Date:
#> [1] "1/2/2020" "1/31/2020" "12/31/1999"
#> [1] "2020-01-06" "2020-02-04" NA
deidfile # Note last test was on a person not in the crosswalk
#> Pat-ID: Exam Date: Exam Time: Exam Eye: Exam Type: Exam Comment:
#> 1 b 2020-01-06 123 OS penta <NA>
#> 2 b 2020-02-04 234 OS penta <NA>
#> 3 <NA> <NA> 321 OD penta <NA>
#> Status Error Rf F (mm): Rs F (mm): Rh F (mm): Rv F (mm):
#> 1 Good None 9.3 <NA> 4.9 <NA>
#> 2 Good, Really Good! None 9.3 8.7 4.1 6.3
#> 3 Bad Yes 9 8.7 <NA> 6.3
#> K1 F (D): K2 F (D): Rm F (mm): Km F (D): Axis F (flat): Astig F (D):
#> 1 50 60 8.1 35 75 10
#> 2 40 50 8 30 80 30
#> 3 0 0 7.9 10 20 30
#> R Per F (mm) R Min (mm)
#> 1 5 9
#> 2 5.5 10
#> 3 1.1 6