Data Linkage and Identification

The NetHealth Project collected data in a variety of ways and from a variety of devices: 

    • Communication data from smartphones
    • Fitbit data from Fitbit devices
    • Basic Survey and Network Survey data from Qualtrics online surveys
    • Grade and course data from the ND Registrar’s office

Study participants were primarily identified and tracked through their Notre Dame NetId Email address.  The registrar supplied us with a list of all students in the incoming 2015 cohort along with their NetId Emails.   We corresponded with students through this email and used it in our internal tracking files to keep track of human subject payments and changes in phone numbers and email addresses .  

The study collected data not only from our participants, but through the participants about people in their social networks.  This was done in two ways:

    1. Through the collection of communication data from smartphones.  When study participants communicated (through text, calls, and messages) with others, their phones record contains the phone numbers and/or email addresses of the other party.
    2. Through the Network Surveys in which participants were asked to tell us about people in their social network we obtained names, phone numbers and email addresses on the people in a participant’s network.  

As a result, we have numerous identifiers in various datasets that reference either a study participant or someone a study participant communicated with or listed as someone in their network:

    • Notre Dame NetId Emails
    • Other email addresses 
    • Phone numbers
    • Fitbit identification codes
    • First and last names   

Because we asked study participants in the Network Surveys to provide the phone numbers and email addresses of the people they named, we can link communication data involving non-study students to network survey data provided by the participant who communicated with that person.  However, there are records in the communication data that involve a study participant and another individual who was not named by that participant or any other NetHealth participants; as such, we have no information on these individuals. For some of these records, we were able to identify the person connected to a number or email address because we had a list of all incoming (2015) first-year students with their email addresses, but, for most of these records (i.e., those of non-participants about whom no data was provided in the Network Surveys), the phone number or email can not be linked to a person. We still retain information on communications of our participants with these non-identifiable individuals.  

In order to make all this data accessible so that researchers can link records across data sets while also preserving anonymity and not releasing identifying information (names, phone numbers, email addresses), we devised and implemented a very rigorous and involved system for clustering the identifiers we have in the data.  The objective of this clustering system is to cluster together the various identifiers (email addresses, phone numbers, Fitbit device identifiers, names) so that the likelihood that all the identifiers belong to one specific individual (the cluster) is the greatest. We then assign to each identified cluster a random integer case number that is non-identifying.   We then substitute for all identifiers (names, addresses, etc.)  in the raw data this non-identifying case number which references a specific person.   This case number is the same number in all records in which the various identifiers were found to cluster together as the same person, effectively linking data records across data sources. Detailed information on this system can be found in the Communication Data Cleaning and NetHealth Network Survey Cleaning (Section – Record Linkage) documents under Supplemental Material. 

 Table 2: Case Record Codes (Vertex Codes)  details the meaning of this case number.  

    • A 4-digit number indicates a text or call to or from an entity which as far as we can tell is not a person.  These records are excluded from the communication event data that we are releasing.  
    • A 5-digit number indicates a person who was at some point officially a NetHealth Study participant.  
    • A 6-digit number for people named by a study participant in a Network Survey.
    • A 7-digit number indicates a person who was neither named in a Network Survey nor a study participant but for whom we have a legitimate phone number or email addresses.  Some additional information is provided about these “unknown” persons by the first digit of the 7-digit code (see Table 2: Case Record Codes):  if the first digit is a “1” that email address was on a list of emails we obtained in 2015 from the Registrar of all Class of 2019 students, indicating that they were a ND student. 
    • An 8-digit number for all questionable phone number values.  These records are excluded from the released communication data.

This non-identifying case number is used to identify all records in all the various data sets and replaces all the other identifiers that were initially used to identify records.  

    • Survey Data:  Seeing that this information comes only from study participants, all records are identified by a 5-digit case number.
    • Fitbit Data: Seeing that this information comes only from study participants, all records are identified by a 5-digit case number.
    • Course and Grades Data:  Seeing that this information was obtained from the ND Registrar only from participants who consented, all case numbers have 5 digits.
    • Network Survey Data:  This is relational data in which each record contains two case codes, one for the person (ego) who completed the survey and named people (alters) in their network, and one for each alter that the ego named.  All ego case numbers must be 5-digits. Alter case numbers must be either 5-digit (if ego names another study participant) or 6 or 7 digits (if they named a non-participant).
    • Network Survey Data, alter-alter list:  In the Network Survey, participants (egos) were asked, for each pair of their alters, whether the two alters knew each other.  We release this data as an edge list in which both case numbers can be either a participant (5-digit) or a named alter (6-digit). 
    • Communication Event Data:  This is relational data where each record contains two case numbers, one for the person whose device recorded the communication event and one for the person with whom they were communicating.  The device that recorded the event must be owned by a study participant, so all the case codes for the person from whom we received the data have 5 digits. The other person involved in the communication could be of any type:  a study participant (5-digit), a named alter (6-digit), or unknown but valid address or phone number (7-digit). 

This case numbering system allows researchers to link data across different sources while preserving anonymity of the people about whom we have information.  Researchers can link across data sets to merge information on study participants obtained from the basic surveys, Fitbit devices and their communication events.  Researchers are also able with this system to link information across data sets (and waves) on the alters who study participants named. If the named alter is also a study participant,  the named alter will have a 5-digit code in the Network Survey data which can be used to link their basic survey and communication event data. In addition, information on named alters who are named in more than one network survey (either by other study participants or by the same participant at different times) are linked.