Henry Kautz

The approach has the potential to dwarf previous methods for health monitoring in scalability and immediacy. The project has expanded to include social network data gathered from cities and airports around the world. “Today, it can take months to figure out where in the world a disease outbreak originated—and meanwhile people from that area will be carrying the disease all around the world,” Kautz says.

 

Big Data Could Provide Early Warning System on Disease Outbreaks

Henry Kautz

Big Data research at the University of Rochester could eventually help authorities identify global disease outbreaks in their earliest stages and track their spread.

It is the next step in a project that made international headlines. Henry Kautz, chair of computer science, and Adam Sadilek, now a postdoctoral fellow, demonstrated that they could predict which Twitter users would get the flu—up to eight days in advance—by “mining” the social media network for tweets of people reporting symptoms in the New York City area.

They used the GPS tags embedded in the tweets sent from cell phones to track those persons’ encounters with other Twitter users, whose own risks of becoming ill could then be calculated and tested.

The approach has the potential to dwarf previous methods for health monitoring in scalability and immediacy. The project has expanded to include social network data gathered from cities and airports around the world. “Today, it can take months to figure out where in the world a disease outbreak originated—and meanwhile people from that area will be carrying the disease all around the world,” Kautz says. But by applying large-scale machine learning methods to a social network like Twitter, “in a matter of days, we could say there’s a disease outbreak in Los Angeles and it looks like the point of origin could be Buenos Aires.”

Kautz is confident this approach “could give researchers, medical professionals, and organizations like the Centers for Disease Control a sort of early warning system that could be applicable to all kinds of disease outbreaks.” In addition to improving immediate response to disease outbreaks, the data can also be mined to help answer fundamental questions, such as how large-scale epidemics emerge from low-level interactions between people in the course of their everyday lives.

Most previous work in computational epidemiology focuses on “simulated populations and hypothetical scenarios,” Kautz and Sadilek note. Instead, for their flu study, they used a Twitter search application to collect 16 million “real time” tweets from 630,000 different users in the New York City area during a single month. They zeroed in on the tweets of 6,237 individuals who posted more than 100 GPS-tagged tweets during the study period.

The researchers developed statistical natural-language processing algorithms that identified 2,047 tweets reporting flu-like symptoms. Locations were mapped, other Twitter users who visited the same locations were identified, and probabilistic models were then constructed to predict if and when an individual would fall ill.

Kautz is looking at other areas where the application of Big Data methods to Twitter could bear fruit. Would it be possible, for example, to estimate people’s emotional states from their tweets? When depressed people are tweeting with other people who either are or are not depressed, what is the affect— is depression contagious?

Big Data mining of social networks is sometimes equated with Big Brother–like invasions of privacy. But there’s an important distinction to be drawn, Kautz says.

“Dozens of companies are data mining your social media in order to try to sell you things,” Kautz says. “This may or may not be a good thing. By contrast, our goal of improving national and global health is clearly a benefit to society.”