The communication event data is derived from the logs of voice calls and various types of text messages that are kept by the operating systems of Android phones and iPhones. Over the 4+ year period we obtained over 60.5 million records of communication events. All communication events are indexed by the device from which the event details were retrieved, i.e. a study participant. Event records are ordered pairs {egoid, alterid} where the first case code number (egoid) references a study participant, while the second case code number (alterid) references the entity with whom the study participant communicated.
These data were obtained from the logs maintained by the operating systems of Android phones and iPhones. Given the popularity of iPhones during this period, over 95% of the events were obtained from iPhones. Android event data was retrieved through an app that participants installed on their phones and which sent the logs to our servers every night. iPhone event data was retrieved from backups of an iPhone that participants periodically did which were then processed by an app participants placed on their computer which extracted the communication event data and sent it to our secure servers. Because iPhone backup files are cumulative (unless erased), initial backups from iPhones generated records with dates prior to arrival at Notre Dame.
We filtered out all communications with non-humans using the case coding system described in Table 2: Case Record Codes and discussed in Data Linkage and Identification section. We deleted from this release of data all events in which the alterid was either a 4 digit (non person numbers or codes) or 8 digit number (questionable numbers or codes), leaving communication with other study participants (5 digit code), people named in one of the network surveys (6 digit code) or other phone numbers or emails that we believe are real people (7 digit code).
Users may want to filter out additional cases. We recommend filtering out:
-
-
- Calls of zero duration (416,056 records)
- Messages of zero Length or Bytes (163,621 and 156,872 records, respectively). However, records with missing values for Duration/Length/Bytes should be kept.
- WhatsApp Group Chats (2,165,090 records) as members of a WhatsApp Group receive messages not directed at them.
-
Because Android phones and iPhones store data on communication events in different formats, we initially stored Android and iPhone event data in different databases and then combined them.
As detailed in Table 3: Communication Event Type Codes , events can be of one of four types: calls, SMS (Short Message Service) , MMS (Multimedia Messaging Service), or WhatsApp events. Over 87% of the events are SMS. Because of issues with the call data records derived from Android phones, we have no logs of calls involving Android users. Also only iPhones retain WhatsApp communication event records, so Android users have no WhatsApp events.
For each event record the following information is available
-
- The date and time of the event, along with Epoch time (Unix time) timestamp , the original date-time indicator used to derive the calendar date and times. From this time stamped data we computed for users:
- The NetHealth study week and day on which the event occurred. Study week values range from -33 to 196 (the last week of the study). Negatively numbered weeks are weeks prior to the first week of the study (August 16 – 22, 2015). Negative number days are days prior to the first day of the study (August 16, 2015).
- The day of the week on which the event occurred.
- An indicator variable for whether classes were in session on the day of the event.
- Whether the event was an outgoing (i.e. sent by study participant from whose phone the data was retrieved) or an incoming communication.
- The type of event: calls, SMS, MMS, WhatsApp messages. The variable eventtypedetail provides more detailed information on the types of communication events. Table 3: Communication Event Type Codes lists all the event types with the corresponding event counts.
- The duration of the voice call
- The length of the SMS message (in characters) or number of bytes if a MMS.
- Confidence values for the egoid and alterid case numbers derived from the clustering algorithms used to group devices that belong to the same person (see Data Linkage and Identification). The values are our asssessment of the “probability” that we correctly matched a device identifier to the correct person. While values are presented as probabilities between 0 and 1, they are really ordinal and are codes indicative of how certain we are based on results from the algorithms that an identifier in the data belongs to the cluster that we assigned it to. Egoids are all study participants and therefore have l have highest value possible value we assign, .95. Alterids with confidence values of .6 or greater are good (~80% of the records). We include in this release records with alterids for which we are less confident, but still fairly confident based on the results of the algorithm, that the identifier belongs to the assigned cluster.
- The date and time of the event, along with Epoch time (Unix time) timestamp , the original date-time indicator used to derive the calendar date and times. From this time stamped data we computed for users:
Note: In the Communication Event data, the egoid is always a NetHealth participant because egoid identifies the individual from whose device we received the event data, and we only received communication event data from NetHealth participants. The alterid can be a participant, person named by a study participant in a network survey, or some other type of person. If users want to arrange the data so that records are indexed in terms of sender and receiver, then they need to use the outgoing variable to reindex the records. Note that it is possible for an event to involve two NetHealth participants and therefore there can be two records for the same event because that event was logged on two different devices.
Derivative Data:
This event data can be used to construct network social tie variables. To do this, data must be rearranged so that records are ordered pairs {sender, receiver} indicating a communication event between a sender of a communication and a receiver of that communication using the outgoing variable. Researchers can then generate an arc list (for ordered pairs) by specifying a time window (e.g. daily, weekly, monthly, Fall 2015 semester). See Calendar under Data for a table containing information on when semesters end and begin, when break weeks occur, etc. Aggregating all events for the same ordered pair will yield an arc list of all ties between a sender and received along with counts of the number of events initiated by sender for that receiver. Researchers can generate daily or weekly event count time series for all arcs. Collapsing within senders across receivers yields measures of outdegree (number of alters a person initiated a communication with in a given time period). Arc lists can be converted to edge lists and collapsed to yield measures of degree (number of alters a person communicated with during a given time period).