The NetHealth Project collected data in a variety of ways and from a variety of devices: 

  • Communication data from smartphones
  • Fitbit data from Fitbit devices
  • Basic Survey and Network Survey data from Qualtrics online surveys
  • Grade and course data from the ND Registrar’s office

 

Study participants were primarily identified and tracked through their Notre Dame NetId Email address.  The registrar supplied us with a list of all students in the incoming 2015 cohort along with their NetId Emails.   We corresponded with students through this email and used it in our internal tracking files to keep track of human subject payments and changes in phone numbers and email addresses .  

 

The study collected data not only from our participants, but through the participants about people in their social networks.  This was done in two ways:

  1. Through the collection of communication data from smartphones.  When study participants communicated (through text, calls, and messages) with others, their phones record the phone numbers and/or email addresses of the other party.
  2. Through the Network Surveys in which participants were asked to tell us about people in their social network, we obtained names, phone numbers and email addresses on the people in a participant’s network.  

 

As a result we have numerous identifiers in various datasets the reference either a study participant or someone a study participant communicated with or listed as someone in their network:

  • Notre Dame NetId Emails
  • Other email addresses 
  • Phone numbers
  • Fitbit identification code
  • First and last names   

 

Because we asked study participants in the Network Surveys to provide the phone numbers and email addresses of the people they named, we can link communication data involving non-study students to network survey data provided by the participant who communicated with that person.  However there are records in the communication data that involve a study participant and another individual who was not named by that participant or any other NetHealth participants and therefore we have no information on them. For some of these records were were able to identify the person connected to a number of email address because we obtained a list of all incoming (2015) first year students with their email addresses, but for most of these records the phone number or email can not be linked to a person. We still retain information on communications of our participants with these non-identifiable individuals.  

 

In order to make all this data accessible so that researchers can link records across data sets while also preserving anonymity and not releasing identifying information (names, phone numbers, email addresses) we devised and implemented a very rigorous and involved system for clustering identifiers.  Detailed information on this system can be found in the Communication Data Cleaning document under Supplemental Material. 

 

Through the clustering algorithms, identifiers were clustered together so that identifiers in the same cluster have the highest probability of being linked to the same person.  Clusters of identifiers were then assigned a random integer case number that is non-identifying.   Table 2: Case Record Codes (Vertex Codes)  details the meaning of this case number.  

  • A 4-digit number indicates a text or call to or from an entity which as far as we can tell is not a person.  These records are excluded from the communication event data that we are releasing.  
  • A 5-digit number indicates a person who was at some point officially a NetHealth Study participant.  
  • A 6-digit number for people named by a study participant in a Network Survey.
  • A 7-digit number indicates a person who was neither named in a Network Survey nor a study participant but for whom we have a legitimate phone number or email addresses.  Some additional information is provided about these “unknown” persons by the first digit of the 7-digit code (see Table 2: Case Record Codes):  if the first digit is a “1” that email address was on a list of emails we obtained in 2015 from the Registrar of all Class of 2019 students, indicating that they were a ND student. 
  • An 8-digit number for all questionable phone number values.  These records are excluded from the released communication data.

 

This non-identifying case number is used to identify all records in all the various data sets and replaces all the other identifiers that were initially used to identify records.  

 

  • Survey Data:  Seeing that this information comes only from study participants, all records are identified by a 5-digit case number.
  • Fitbit Data: Seeing that this information comes only from study participants, all records are identified by a 5-digit case number.
  • Course and Grades Data:  Seeing that this information was obtained from the ND Registrar only from participants who consented, all case numbers have 5 digits.
  • Network Survey Data:  This is relational data in which each record contains two case codes, one for the person who completed the survey and named people (alters) in their network, and one for each alter that the ego named.  All ego case numbers must be 5-digits. Alter id’s must be either 5-digit (if ego names another study participant) or 6 or 7 digits (if they named a non-participant).
  • Network Survey Data, alter-alter list:  In the Network Survey, participants were asked, for each pair of their alters, whether the two alters knew each other.  We release this data as an edge list in which both case numbers must have 5, 6 or 7 digits. 
  • Communication Event Data:  This is relational data where each record contains two case numbers, one for the person whose device recorded the communication event and one for the person with whom they were communicating.  The device that recorded the event must be owned by a study participant, so all the case codes for the person from whom we received the data have 5 digits. The other person involved in the communication could be of any type:  a study participant (5-digit), a named alter (6-digit), or unknown but valid address or phone number (7-digit). 

 

This case numbering system allows researchers to link data across different sources while preserving anonymity of the people about whom we have information.  Researchers can link across data sets to merge information on study participants obtained from the basic surveys, Fitbit devices and their communication events.  Researchers are also able with this system to link information across data sets (and waves) on the alters who study participants named. If the named alter is also a study participant,  the named alter will have a 5-digit code in the Network Survey data which can be used to link their basic survey data. In addition, information on named alters who are named in more than one network survey (either by other study participants or by the same participant at different times) can be linked.  

 

Because linking data across sources increases the possibility that users can identify individuals, for the public release of the data we de-link all data.  This involves generating  new random case codes for the Survey Data, the Fitbit Data, the Network Survey data, and the Communication Event Data.  With this de-linked data, researchers can explore each data set individually. Researchers who want access to the linked data can request access by submitting a request in the Data section detailing the research project and consenting to not share the linked data.