{"id":607,"date":"2020-02-17T01:53:07","date_gmt":"2020-02-17T05:53:07","guid":{"rendered":"http:\/\/sites.nd.edu\/nethealth\/?page_id=607"},"modified":"2020-03-06T17:48:19","modified_gmt":"2020-03-06T21:48:19","slug":"data-linkage-and-identification","status":"publish","type":"page","link":"https:\/\/sites.nd.edu\/nethealth\/data-linkage-and-identification\/","title":{"rendered":"Data Linkage and Identification"},"content":{"rendered":"<p><span style=\"font-weight: 400\">The NetHealth Project collected data in a variety of ways and from a variety of devices:\u00a0<\/span><\/p>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Communication data from smartphones<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Fitbit data from Fitbit devices<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Basic Survey and Network Survey data from Qualtrics online surveys<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Grade and course data from the ND Registrar\u2019s office<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">Study participants were primarily identified and tracked through their Notre Dame NetId Email address.\u00a0 The registrar supplied us with a list of all students in the incoming 2015 cohort along with their NetId Emails. \u00a0 We corresponded with students through this email and used it in our internal tracking files to keep track of human subject payments and changes in phone numbers and email addresses .\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">The study collected data not only from our participants, but through the participants about people in their social networks.\u00a0 This was done in two ways:<\/span><\/p>\n<ol>\n<li style=\"list-style-type: none\">\n<ol>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Through the collection of communication data from smartphones.\u00a0 When study participants communicated (through text, calls, and messages) with others, their phones record contains the phone numbers and\/or email addresses of the other party.<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Through the Network Surveys in which participants were asked to tell us about people in their social network we obtained names, phone numbers and email addresses on the people in a participant\u2019s network.\u00a0\u00a0<\/span><\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p><span style=\"font-weight: 400\">As a result, we have numerous <\/span><i><span style=\"font-weight: 400\">identifiers<\/span><\/i><span style=\"font-weight: 400\"> in various datasets that reference either a study participant or someone a study participant communicated with or listed as someone in their network:<\/span><\/p>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Notre Dame NetId Emails<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Other email addresses\u00a0<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Phone numbers<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Fitbit identification codes<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">First and last names\u00a0\u00a0\u00a0<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">Because we asked study participants in the Network Surveys to provide the phone numbers and email addresses of the people they named, we can link communication data involving non-study students to network survey data provided by the participant who communicated with that person.\u00a0 However, there are records in the communication data that involve a study participant and another individual who was not named by that participant or any other NetHealth participants; as such, we have no information on these individuals. For some of these records, we were able to identify the person connected to a number or email address because we had a list of all incoming (2015) first-year students with their email addresses, but, for most of these records (i.e., those of non-participants about whom no data was provided in the Network Surveys), the phone number or email can not be linked to a person. We still retain information on communications of our participants with these non-identifiable individuals.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">In order to make all this data accessible so that researchers can link records across data sets while also preserving anonymity and not releasing identifying information (names, phone numbers, email addresses), we devised and implemented a very rigorous and involved system for clustering the identifiers we have in the data.\u00a0 The objective of this clustering system is to cluster together the various identifiers (email addresses, phone numbers, Fitbit device identifiers, names) so that the likelihood that all the identifiers belong to one specific individual (the cluster) is the greatest. We then assign to each identified cluster a random integer <\/span><i><span style=\"font-weight: 400\">case number <\/span><\/i><span style=\"font-weight: 400\">that is non-identifying. \u00a0 We then substitute for all identifiers (names, addresses, etc.)\u00a0 in the raw data this non-identifying case number which references a specific person. \u00a0 This case number is the same number in all records in which the various identifiers were found to cluster together as the same person, effectively linking data records across data sources. Detailed information on this system can be found in the <\/span><i><span style=\"font-weight: 400\">Communication Data Cleaning<\/span><\/i><span style=\"font-weight: 400\"> and <\/span><i><span style=\"font-weight: 400\">NetHealth Network Survey Cleaning (Section &#8211; Record Linkage) <\/span><\/i><span style=\"font-weight: 400\">documents under <\/span><b>Supplemental Material.<\/b><span style=\"font-weight: 400\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0<\/span><a href=\"https:\/\/drive.google.com\/open?id=1XSq31sybzHJkxn49eFg_UkfhbE16u1Wk\"><span style=\"font-weight: 400\">Table 2: Case Record Codes (Vertex Codes)<\/span><\/a> <span style=\"font-weight: 400\">\u00a0details the meaning of this case number.\u00a0\u00a0<\/span><\/p>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">A 4-digit number indicates a text or call to or from an entity which as far as we can tell is not a person.\u00a0 These records are excluded from the communication event data that we are releasing.\u00a0\u00a0<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">A 5-digit number indicates a person who was at some point officially a NetHealth Study participant.\u00a0\u00a0<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">A 6-digit number for people named by a study participant in a Network Survey.<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">A 7-digit number indicates a person who was neither named in a Network Survey nor a study participant but for whom we have a legitimate phone number or email addresses.\u00a0 Some additional information is provided about these \u201cunknown\u201d persons by the first digit of the 7-digit code (see <\/span><a href=\"https:\/\/drive.google.com\/open?id=1XSq31sybzHJkxn49eFg_UkfhbE16u1Wk\"><span style=\"font-weight: 400\">Table 2: Case Record Codes<\/span><\/a><span style=\"font-weight: 400\">):\u00a0 if the first digit is a \u201c1\u201d that email address was on a list of emails we obtained in 2015 from the Registrar of all Class of 2019 students, indicating that they were a ND student.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">An 8-digit number for all questionable phone number values.\u00a0 These records are excluded from the released communication data.<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">This non-identifying case number is used to identify all records in all the various data sets and replaces all the other identifiers that were initially used to identify records.\u00a0\u00a0<\/span><\/p>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li style=\"font-weight: 400\"><i><span style=\"font-weight: 400\">Survey Data<\/span><\/i><span style=\"font-weight: 400\">:\u00a0 Seeing that this information comes only from study participants, all records are identified by a 5-digit case number.<\/span><\/li>\n<li style=\"font-weight: 400\"><i><span style=\"font-weight: 400\">Fitbit Data<\/span><\/i><span style=\"font-weight: 400\">: Seeing that this information comes only from study participants, all records are identified by a 5-digit case number.<\/span><\/li>\n<li style=\"font-weight: 400\"><i><span style=\"font-weight: 400\">Course and Grades Data<\/span><\/i><span style=\"font-weight: 400\">:\u00a0 Seeing that this information was obtained from the ND Registrar only from participants who consented, all case numbers have 5 digits.<\/span><\/li>\n<li style=\"font-weight: 400\"><i><span style=\"font-weight: 400\">Network Survey Data<\/span><\/i><span style=\"font-weight: 400\">:\u00a0 This is relational data in which each record contains two case codes, one for the person (ego) who completed the survey and named people (alters) in their network, and one for each alter that the ego named.\u00a0 All ego case numbers must be 5-digits. Alter case numbers must be either 5-digit (if ego names another study participant) or 6 or 7 digits (if they named a non-participant).<\/span><\/li>\n<li style=\"font-weight: 400\"><i><span style=\"font-weight: 400\">Network Survey Data, alter-alter list<\/span><\/i><span style=\"font-weight: 400\">:\u00a0 In the Network Survey, participants (egos) were asked, for each pair of their alters, whether the two alters knew each other.\u00a0 We release this data as an edge list in which both case numbers can be either a participant (5-digit) or a named alter (6-digit).\u00a0<\/span><\/li>\n<li style=\"font-weight: 400\"><i><span style=\"font-weight: 400\">Communication Event Data<\/span><\/i><span style=\"font-weight: 400\">:\u00a0 This is relational data where each record contains two case numbers, one for the person whose device recorded the communication event and one for the person with whom they were communicating.\u00a0 The device that recorded the event must be owned by a study participant, so all the case codes for the person from whom we received the data have 5 digits. The other person involved in the communication could be of any type:\u00a0 a study participant (5-digit), a named alter (6-digit), or unknown but valid address or phone number (7-digit).\u00a0<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">This case numbering system allows researchers to link data across different sources while preserving anonymity of the people about whom we have information.\u00a0 Researchers can link across data sets to merge information on study participants obtained from the basic surveys, Fitbit devices and their communication events.\u00a0 Researchers are also able with this system to link information across data sets (and waves) on the alters who study participants named. If the named alter is also a study participant,\u00a0 the named alter will have a 5-digit code in the Network Survey data which can be used to link their basic survey and communication event data. In addition, information on named alters who are named in more than one network survey (either by other study participants or by the same participant at different times) are linked.\u00a0\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The NetHealth Project collected data in a variety of ways and from a variety of devices:\u00a0 Communication data from smartphones Fitbit data from Fitbit devices Basic Survey and Network Survey data from Qualtrics online surveys Grade and course data from the ND Registrar\u2019s office Study participants were primarily identified and tracked through their Notre Dame &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/sites.nd.edu\/nethealth\/data-linkage-and-identification\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Data Linkage and Identification&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2815,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-607","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/sites.nd.edu\/nethealth\/wp-json\/wp\/v2\/pages\/607","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.nd.edu\/nethealth\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/sites.nd.edu\/nethealth\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/sites.nd.edu\/nethealth\/wp-json\/wp\/v2\/users\/2815"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.nd.edu\/nethealth\/wp-json\/wp\/v2\/comments?post=607"}],"version-history":[{"count":5,"href":"https:\/\/sites.nd.edu\/nethealth\/wp-json\/wp\/v2\/pages\/607\/revisions"}],"predecessor-version":[{"id":735,"href":"https:\/\/sites.nd.edu\/nethealth\/wp-json\/wp\/v2\/pages\/607\/revisions\/735"}],"wp:attachment":[{"href":"https:\/\/sites.nd.edu\/nethealth\/wp-json\/wp\/v2\/media?parent=607"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}