Skip to contents

This function is used to de-identify a flat file using a crosswalk (with specific structure, see loadxwalks()). De-identification options include replacing a patient MRN by a tokenized MRN (using the crosswalk), removing a column, 'blanking' a column (the column remains in the dataset but all values are ""), and shifting a date or datetime variable (using the crosswalk).

Usage

fileDeID(
  filetodeid,
  fd_varname_mrn = "PAT_MRN",
  variablestoremove = character(0),
  variablestoblank = character(0),
  datevariablestodateshift = character(0),
  dateformat = "%Y-%m-%d",
  datetimevariablestodateshift = character(0),
  datetimeformat = "%Y-%m-%d %H:%M:%OS",
  separator = ",",
  separator_out = ",",
  xwalk,
  compare_mrn_numeric = attr(xwalk, "compare_mrn_numeric"),
  outputfile = NULL,
  usefread = TRUE,
  verbose = 0L
)

Arguments

filetodeid

character. Filename of a single flat file to de-identify.

fd_varname_mrn

character. Name of MRN variable in file to de-identify. Default is "PAT_MRN".

variablestoremove

character vector. Names of variables in extraction file to remove. Default is character(0), no variables to remove.

variablestoblank

character vector. Names of variables in extraction file to blank. The variable will remain in the output file but will be ” for all rows. Default is character(0), no variables to blank.

datevariablestodateshift

character vector. Names of variables that are dates and should be date-shifted. Default is character(0), no variables to shift.

dateformat

character. Format of date variables in the extraction file. Default is "%Y-%m-%d" corresponding to 4-digit year, hyphen, 2-digit month, hyphen, 2-digit day.

datetimevariablestodateshift

character vector. Names of variables that are datetimes and should be date-shifted. Default is character(0), no variables to shift.

datetimeformat

character. Format of datetime variables in the extraction file. Default is "%Y-%m-%d %H:%M:%OS" corresponding to 4-digit year, hyphen, 2-digit month, hyphen, 2-digit day, space, 2-digit (24) hour, colon, 2-digit minute, colon 2-digit second.

separator

character. Field separator in filetodeid (input file). Default is ",".

separator_out

character. Field separator to use in outputfile. Default is ",".

xwalk

data.frame containing crosswalk information. Usually the output from loadxwalks().

compare_mrn_numeric

logical. Should MRNs be compared as numeric variables? Usually this is a good idea because leading 0s may have been dropped during processing. Default is whatever was used to create xwalk, which is TRUE by default.

outputfile

character. Name of file to write de-identified data. If NULL (the default) the data are not written, but only returned from the function. An additional option is the special value "SOURCE_", which causes the output to be written to the same filename as the input but prepended with "SOURCE_".

usefread

If TRUE (default), use data.table::fread() and data.table::fwrite(). If FALSE, use utils::read.csv() and utils::write.csv(). TRUE is usually preferable as FALSE results in double quotes around almost all values when producing the output file.

verbose

integer. Higher values produce more output to console. Default is 0, no output.

Value

(invisibly) data.frame (even if usefread == TRUE) with variables tokenized, date-shifted, removed, and/or blanked, as requested.

Examples

dataloc <- system.file("extdata", package = "DeIDmachinedata")
fn1 <- sprintf("%s/xwalk1.csv", dataloc)
fn2 <- sprintf("%s/xwalk2.csv", dataloc)
xwalk <- loadxwalks(tokenfile = fn1, dateshiftfile = fn2)
fn3 <- sprintf("%s/pentacam_UCH.csv", dataloc)
deidfile <- fileDeID(
  filetodeid = fn3,
  fd_varname_mrn = "Pat-ID:",
  variablestoremove = c("Last Name:", "First Name:", "D.o.Birth:"),
  variablestoblank = "Exam Comment:",
  datevariablestodateshift = "Exam Date:",
  dateformat = "%m/%d/%Y",
  xwalk = xwalk,
  outputfile = NULL,
  verbose = 2)
#> Processing /home/runner/work/_temp/Library/DeIDmachinedata/extdata/pentacam_UCH.csv
#> /home/runner/work/_temp/Library/DeIDmachinedata/extdata/pentacam_UCH.csv Variable Names
#>  [1] "Last Name:"     "First Name:"    "Pat-ID:"        "D.o.Birth:"    
#>  [5] "Exam Date:"     "Exam Time:"     "Exam Eye:"      "Exam Type:"    
#>  [9] "Exam Comment:"  "Status"         "Error"          "Rf F (mm):"    
#> [13] "Rs F (mm):"     "Rh F (mm):"     "Rv F (mm):"     "K1 F (D):"     
#> [17] "K2 F (D):"      "Rm F (mm):"     "Km F (D):"      "Axis F (flat):"
#> [21] "Astig F (D):"   "R Per F (mm)"   "R Min (mm)"    
#> 3 rows read from /home/runner/work/_temp/Library/DeIDmachinedata/extdata/pentacam_UCH.csv
#>   Pat-ID: Exam Date: Exam Time: Exam Eye: Exam Type: Exam Comment:
#> 1       2   1/2/2020        123        OS      penta           PII
#> 2       2  1/31/2020        234        OS      penta           PII
#> 3       9 12/31/1999        321        OD      penta           PII
#>               Status Error Rf F (mm): Rs F (mm): Rh F (mm): Rv F (mm):
#> 1               Good  None        9.3       <NA>        4.9       <NA>
#> 2 Good, Really Good!  None        9.3        8.7        4.1        6.3
#> 3                Bad   Yes          9        8.7       <NA>        6.3
#>   K1 F (D): K2 F (D): Rm F (mm): Km F (D): Axis F (flat): Astig F (D):
#> 1        50        60        8.1        35             75           10
#> 2        40        50          8        30             80           30
#> 3         0         0        7.9        10             20           30
#>   R Per F (mm) R Min (mm)
#> 1            5          9
#> 2          5.5         10
#> 3          1.1          6
#> Summary of MRN matching index
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>       2       2       2       2       2       2       1 
#> Number of patients with each number of test (original MRN)
#> 
#> 1 2 
#> 1 1 
#> Number of patients with each number of test (tokenized MRN)
#> 
#> 1 2 
#> 1 1 
#>      1 tokenized mrns missing of      3
#> Shifting Exam Date:
#> [1] "1/2/2020"   "1/31/2020"  "12/31/1999"
#> [1] "2020-01-06" "2020-02-04" NA          
deidfile # Note last test was on a person not in the crosswalk
#>   Pat-ID: Exam Date: Exam Time: Exam Eye: Exam Type: Exam Comment:
#> 1       b 2020-01-06        123        OS      penta          <NA>
#> 2       b 2020-02-04        234        OS      penta          <NA>
#> 3    <NA>       <NA>        321        OD      penta          <NA>
#>               Status Error Rf F (mm): Rs F (mm): Rh F (mm): Rv F (mm):
#> 1               Good  None        9.3       <NA>        4.9       <NA>
#> 2 Good, Really Good!  None        9.3        8.7        4.1        6.3
#> 3                Bad   Yes          9        8.7       <NA>        6.3
#>   K1 F (D): K2 F (D): Rm F (mm): Km F (D): Axis F (flat): Astig F (D):
#> 1        50        60        8.1        35             75           10
#> 2        40        50          8        30             80           30
#> 3         0         0        7.9        10             20           30
#>   R Per F (mm) R Min (mm)
#> 1            5          9
#> 2          5.5         10
#> 3          1.1          6