Skip to content

Identifiability of Research Data

Adequately and accurately describing the identifiability of research data and biospecimens throughout the life cycle of a study protocol is an important key to facilitating Institutional Review Board (IRB) review. The way data and/or specimens are generated, collected (or recorded), and stored has ramifications for whether a research proposal meets the:

Unfortunately, the terms typically used in study protocols to describe how data will be maintained – de-identified, coded, anonymized, and anonymous – are often used interchangeably, when they don’t all have the same meaning. This leaves the IRB unsure of what the data looks like and how it will be protected, which can result in additional and potentially frustrating, back-and-forth with the study team.

Identifiable Data & Specimens

While the federal regulations that govern human subject research refer to ‘identifiable private information’, ‘identifiable biospecimen’, and ‘identifiers’, they do not specifically define data elements that could be used to identify subjects. As such, IRBs routinely defer to the 18 identifiers described in the Health Insurance Portability & Accountability Act (HIPAA) regardless of whether the research contains health information or is conducted within a covered entity.

List of identifiers
  1. Names,
  2. Geographic subdivisions smaller than a State, including: street address, city, county, precinct, and zip code and their equivalent geocodes (except for the initial three digits of a zip code, if according to the current publicly-available data from the Bureau of Census: [1] the geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000; and [2] the initial three digits of the zip code for all such geographic unit containing 20,000 or fewer is changed to 000.),
  3. All elements of dates (except year) for dates directly related to an individual, including (but not limited to): birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of 90 or older,
  4. Telephone numbers,
  5. Fax numbers,
  6. E-mail addresses,
  7. Social security numbers,
  8. Medical record numbers,
  9. Health plan beneficiary,
  10. Account numbers,
  11. Certificate/license numbers,
  12. Vehicle identifiers and serial numbers, including license plate numbers,
  13. Device identifiers and serial numbers,
  14. Web universal resource locators (URLs),
  15. Internet protocol (IP) address numbers,
  16. Biometric identifiers, including voice and fingerprints,
  17. Full-face photographic images and any comparable images, and
  18. Any other unique identifying number, characteristic, or code.         

Key Terms

Coded Data & Specimens

‘Coded’, means that information that might readily identify a subject has been replaced with a number, letter, symbol or combination thereof (i.e., a code) and there is a separate key linking the code to the subject’s identifiable information.

For example, if you assign John Doe the code ‘01’ and use ’01’ to identify the subject within the data set and/or label the subject’s specimen, while simultaneously maintaining a separate key that identifies ‘01’ as John Doe, the data/specimen are considered coded.

The existence of the key means that the data/specimens could be linked back to the individual and therefore makes the data/specimens identifiable, albeit indirectly, to whoever has access to the key. Therefore, the term ‘coded’ is not synonymous with de-identified, anonymized, nor anonymous, because the information is identifiable to study team members with access to the code.

Coding is often used when a study team needs to collect identifiable information in order to meet the aims of the research (e.g., they need to link subject data over time or they need to link subject data between multiple sources); coding is used as a means of maintaining confidentiality.

De-Identified/Anonymized Data & Specimens

‘De-identified’ means the data/specimen cannot be related or attributed to a specific individual, either directly or indirectly, which means there is no reason to believe the data/specimen could be used to identify an individual. In other words, the dataset does not include any of the identifiers listed above, nor are data/specimens affiliated or linked with any of the identifiers listed above.

De-identified data/specimens are often used in retrospective and secondary analysis research. For example, if a study team conducts a review of existing records (e.g., academic/medical records) and only records the research data, without recording any information that directly or indirectly identifies the subjects (i.e., any of the identifiers listed above), the dataset would be considered de-identified. In this instance, the investigator may have access to or view identifiers during the data collection process but does not record, reference, collect, or link identifiers in the data set.

Another common scenario is for a study team to obtain data/specimens from a publicly available dataset, a tissue/data bank, or from previously conducted research. Whether these data/specimens are considered de-identified hinges on the study team’s access to the original source data/specimens (i.e., the original dataset/bank/research) and how the data/specimens for the current research are provided to the study team.

  • If the study team does not have access to the original source data/documents and the data/specimens are provided by a ‘gatekeeper’ in a manner that is neither directly nor indirectly identifiable (i.e., does not include any of the identifiers listed above), the data/specimen would be considered de-identified.
  • If the study team was involved in the original data collection, the gatekeeper engages in the current research study, or the source data/specimens are identifiable, the data/specimens would continue to be considered identifiable. Anytime any member of the study team has access to identifiable source data/specimens (even if stored separately), the secondary use of the data/specimens would be considered identifiable. A common example of this is conducting secondary research on a dataset pulled from prior research. Even if only one member of the study team has access to the original, identifiable (or coded) data set, the dataset used for the secondary analysis would still be considered identifiable.

Study teams will also often de-identify a previously coded or directly identifiable dataset once the research is complete, as a means of further protecting the data. Meaning, all identifiers listed above are removed from the data/specimens and any code, key, or link that previously identified the data/specimens is destroyed. This is also often referred to as ‘anonymized’ data/specimens.

To reiterate, to be considered de-identified or anonymized:

  • the data/specimens collected cannot include any of the identifiers listed above,
  • links or keys that indirectly identify the data/specimens cannot exist, nor the ability to access links/keys by any study team members, and
  • the data/specimens cannot be re-identified.
Anonymous Data & Specimens

‘Anonymous’ means the data/specimens were collected without identifiers and a code/key linking the data/specimens to identifiable information never existed (so there is no way for the data/specimen to be linked back to a specific individual). For example, data pulled from a publicly available database and one-time surveys and specimen collections, where no identifiers are ever collected or recorded would be considered anonymous. One-time interviews and focus groups might also be considered anonymous provided no identifiers are recorded, only pseudonyms are used (in place of names), and they are not audio-recorded.

Tips for facilitating IRB review

  • Carefully consider whether you need to collect identifiers to achieve the aims of your research.
  • Provide consistent information throughout your protocol and all supporting IRB submission materials, including the Human Subject Research Electronic Data Security Assessment Form.
  • Do not cut and paste from previously approved protocols and supporting IRB submission materials, doing so increases the likelihood of providing inconsistent information.
  • Do not inaccurately indicate that the data will be de-identified or limit yourself to only collecting de-identified data under the (false!) assumption that doing so will make IRB review easier. Collect the data you need to in order to meet the aims of your research, but do so with a plan to adequately protect the data/specimens.
  • Be aware that the identifiability of the data/specimens collected may change over the course of the study (e.g., once data collection is complete) and accurately providing this clarification within the study protocol will aid the IRB review process.
  • Carefully consider context and access while communicating about the identifiability of your data/specimens. Identifiability may vary based on a team member or site’s role in the research.  For example, an investigator or study coordinator may have access to identifiable information whereas a statistician may only have access to de-identified information.  Similarly, a local study team may have access to identifiable information but may only share de-identified data with an external collaborating site.  Clearly distinguishing access can play a critical role in both the IRB review process and any required research-related agreements.  For the purposes of IRB review, the IRB needs to make determinations from a global viewpoint, inclusive of all engaged roles/sites they are reviewing for.  Whereas research-related agreements (e.g., data use or material transfer agreements), may be executed for more granular components of research.  Using appropriate terminology in consideration of roles/sites and context will further facilitate the process.