Abstracts & Biographies

David Carey

Carey, David J.

Associate Chief Research Officer, Director, Weis Center for Research, Geisinger Clinic

Leveraging the Resources of an Integrated Health Care System for Translational Research

Geisinger Health System possesses a unique combination of resources that can drive translational research at the individual patient and population level.  The key assets include the integrated anatomy of the health system that combines all aspects of the clinical care process under a single entity; a demographically stable and supportive patient population; an advanced electronic health record infrastructure that has been in place for more than 10 years, and processes and expertise to enable facile mining of longitudinal clinical data; and the MyCode Biobank, a system-wide biorepository of blood, serum and DNA samples that are linkable to patient level electronic clinical data, and are available for broad research use.

Using Geisinger’s participation in the NIH-NHGRI-funded eMERGE (electronic Medical Records and GEnomics) Network and other projects as examples, the utilization of these assets for cutting edge translational research will be described.

Dr. Carey’s primary research interests are in the areas of genomic medicine, vascular biology, and extracellular matrix biology.  An important ongoing effort is the Geisinger MyCode Project, a large biobank of samples from Geisinger patients linked to data in their electronic medical records, as a platform for genomic medicine research. Dr. Carey is Co-PI of Geisinger’s eMERGE project, an NIH funded consortium of institutions with existing biorepositories linkable to electronic medical record data. Primary goals of the eMERGE network are to discover associations between human genetic variants and clinical phenotypes, and to develop means to implement genomic data in clinical care. Dr. Carey is the co-author of more than 100 peer-reviewed research papers in high impact journals, including The Journal of Cell Biology, Journal of Biological Chemistry, Journal of Neuroscience and Annual Review of Physiology.

Aran Garcia-Bellido

Garcia-Bellido, Aran

Assistant Professor, Department of Physics & Astronomy

Data Generation, Simulation, and Analysis in Particle Physics

Garcia-Bellido is an Assistant Professor of Physics at the University of Rochester, and member of the CMS experiment at CERN's Large Hadron Collider in Geneva, Switzerland and the DZERO experiment at Fermi National Accelerator Laboratory in Batavia, IL. His research is focused on the discovery of new particles and precision studies of the heaviest known elementary particle: the top quark, as well as the measurement of the properties of the recently discovered Higgs boson that explains how particles acquire their mass. He has run a specialized data acquisition system to distribute large data volume at high rates to a dedicated farm of more than 200 computers. He has also developed complex algorithms for data analysis relying heavily on distributed computing.

Eric Gawiser

Gawiser, Eric

Associate Professor, Department of Physics & Astronomy at Rutgers, the State University of New Jersey

Billions and Billions of Stars: Big Data in Astronomy

Due to technological advances and a seemingly limitless universe to observe, Astronomy has been one of the leading sciences in obtaining, analyzing, and publicly archiving huge datasets. Current projects are starting to produce "Big Data", and the next generation of observatories, notably the Large Synoptic Survey Telescope (LSST) and Square Kilometer Array (SKA), will complete the transition. A Virtual Astronomical Observatory (VAO) is under development to handle the archiving of the images and catalogs generated by these projects and to catalyze efficient scientific analysis by the astronomical community. As the leading example, LSST will produce tens of petabytes of images over 10 years of operation. The LSST image archive will be processed through sophisticated algorithms to yield a catalog of billions of stars and billions of galaxies, including tens of trillions of database records describing individual observations of those objects.

Eric Gawiser is an Associate Professor in the Department of Physics & Astronomy at Rutgers, the State University of New Jersey. Gawiser studies galaxies, stars and black holes to understand how these objects form and to probe fundamental physics. He received a bachelor's degree in Physics and Public Policy from Princeton University in 1994 and earned his Ph.D. in Physics from U.C. Berkeley in 1999, specializing in theoretical cosmology. He began using the world's largest telescopes to study distant galaxies as a postdoctoral fellow at U.C. San Diego and later a National Science Foundation Astronomy & Astrophysics Postdoctoral Fellow at Yale University. Since joining the Rutgers faculty in 2007, Gawiser has obtained grant support for his research efforts from NASA, the Department of Energy, and the National Science Foundation, including an NSF CAREER Award. He currently serves as the Principal Investigator of the MUSYC collaboration, Co-Chair of the Large Synoptic Survey Telescope (LSST) science collaboration on Large-Scale Structure, and Chair of the National Optical Astronomy Observatory (NOAO) Users Committee. His more than 200 scientific publications have received over 5000 citations, and he has given more than 100 invited talks at scientific conferences and universities. Gawiser is also an accomplished teacher and public speaker, including an appointment as an Associate of the Hayden Planetarium, an Outstanding Teacher Award from the Rutgers Society of Physics Students, and recognition for Distinguished Contributions to Undergraduate Education from the Rutgers School of Arts and Sciences.

M. Eshan Hoque

Hoque, M. Ehsan

Assistant Professor, Computer Science

Computers to Help with Conversations: Affective Framework to Enhance Human Nonverbal Skills

Nonverbal behavior plays an integral part in a majority of social interaction scenarios. Being able to adjust nonverbal behavior and influence other’s responses are considered valuable social skills.

A deficiency in nonverbal behavior can have detrimental consequences in personal as well as in professional life. Many people desire help, but due to limited resources, logistics, and social stigma, they are unable to get the training that they require. Therefore, there is a need for developing techniques and interventions to enhance human nonverbal behaviors that are standardized, objective, repeatable, low-cost, and can be deployed outside of the clinic. In this talk, I will describe the design process and validation techniques of a computational framework to enhance human nonverbal behavior. As part of the framework, I developed My Automated Conversation coacH (MACH)—a novel system that provides ubiquitous access to social skills training. The system includes a virtual agent that reads facial expressions, speech, and prosody, and responds with verbal and nonverbal behaviors in real-time. I will present results on how we went about validating the framework with 90 undergraduate students.

I will wrap up the talk by highlighting the grand-challenge of understanding and recognizing nonverbal behavior. Automated recognition of nonverbal data require modeling of complex, multidimensional data with subtle, uncertain and overlapping labels. The possibilities of collecting nonverbal data via the cloud is enabling new possibilities, but it is also introducing new challenges of modeling a massive amount of messy real-world data. How would we go about solving those challenges that did not exist before?

M. Ehsan Hoque is an Assistant Professor and the director of the Human-Computer Interaction Lab in the Computer Science department of University of Rochester, USA. He has received his PhD degree at the Media Lab of Massachusetts Institute of Technology in August of 2013. Ehsan’s work on nonverbal behavior understanding and recognition has received Best Paper Award in UbiComp 2013, Best Paper Nominations in Face and Gesture (FG), 2011, and Intelligent Virtual Agent (IVA) 2006 conferences, and appeared in popular press including Time Magazine, Wall Street Journal, MIT Technology Review, NPR, PBS, among many. Some of his research prototypes (e.g., Disney animatronics, MIT Mood Meter) have been deployed in Disney Parks, and at several public places of MIT, allowing open interaction with thousands of people and data collection for an extended period.

Ehsan has industrial R&D experience at Goldman Sachs, Walt Disney Imagineering and IBM T. J. Watson Research Center, and was the recipient of IEEE Gold Humanitarian Fellowship in 2009 for his work on Autism intervention.

Saurabh Kataria

Kataria, Saurabh

Research Scientist, Xerox

Big Data and Customer Care: A case study of early intervention using Social Media

As more and more people turn to social media for discussing (among friends) their issues/concerns/experiences with technical service they consume, it offers services/product companies to consume that data for making improvements and target right customer base. From a business perspective, customer care has started to pay attention to pro-actively engage with customers on social media to facilitate early intervention for resolving issues. The process for doing so requires utilizing dynamics of social media to facilitate customer care agents in detecting, tracking and resolving customers' issues. In this talk, I will share some of the current research efforts in social media mining for customer care led by Xerox Research.

Saurabh Kataria received his Ph.D. in Information Science and Technology from Pennsylvania State University. His main area of focus is statistical machine leaning methods applied to social media mining and link discovery in interconnected information sources. Saurabh joined Xerox research center at Webster in 2012 and has been involved in projects related to social media and Big Data.

John Kessler

Kessler, John

Associate Professor, Earth and Environmental Sciences Department, University of Rochester

Big Data in the Ocean Sciences: from ultra-fast instrumentation to global data integration

Covering roughly three-quarters of the surface of our planet, the oceans exert massive influences on global economics, climate, weather, energy, transportation, food, and culture, to name but a few. Our understanding of the dynamics of this reservoir involves a global network of measurements feeding regional and global models. Big data plays a large role in ocean sciences at a variety of scales ranging, for example, from systems of buoys to satellites as well as microbial genetics. Here I will discuss the efforts of my laboratory to develop instruments that collect data at megahertz rates in order to probe the in-situ chemical conditions of ocean waters. The ocean is in a constant state of flux and I will also discuss efforts to coordinate, disseminate, and model data in near real time to aid scientists, policy makers, educators, and those responding to changing oceanographic conditions be those changes induced by natural or industrial processes.

John Kessler is an Associate Professor in the Earth and Environmental Sciences Department at the University of Rochester. He is a chemical oceanographer who focuses on how oceanic dynamics influence atmospheric greenhouse gas budgets. Specifically, he investigates methane and carbon dioxide chemistry in oceans and how they interact with the climate system. Kessler earned his Ph.D. from the University of California Irvine, conducted his postdoctoral research at Princeton University, and is a recent recipient of a Sloan Research Fellowship in Ocean Sciences.

John Langford

Langford, John

Principal Research, Microsoft Research

Learning from Lots of Data

In the last decade, sources of data for machine learning have become extremely plentiful, leading to new classes of problems. I will discuss two such problems, whose joint solution is critical to successful learning here:

(1) How do you create learning algorithms capable of coherently dealing with large quantities of data?
(2) How do you learn given the partial feedback nature available in most large scale data sources?

In addition, I'll discuss some open problems which I'd like to solve in these areas.

John Langford studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor's degree in 1997, and received his Ph.D. from Carnegie Mellon University in 2002. Since then, he has worked at Yahoo!, Toyota Technological Institute, and IBM's Watson Research Center. He is also the primary author of the popular Machine Learning weblog, hunch.net and the principle developer of Vowpal Wabbit. Previous research projects include Isomap, Captcha, Learning Reductions, Cover Trees, and Contextual Bandit learning. For more information visit http://hunch.net/~jl.

Edith Law

Law, Edith

CRCS Postdoctoral Fellow, School of Engineering and Applied Sciences at Harvard University

Re-thinking Citizen Science from the Scientist's Perspective

Science is increasingly data-intensive; many scientific questions require data to be collected over large geographic regions and many time scales, and many of the tasks for reducing raw data to conclusion are not yet automated by computers. The idea of citizen science is to engage massive number of everyday citizens in the collection and interpretation of data in order to answer scientific questions at scale. To date, the barrier to entry has been substantial: researchers who are interested in leveraging crowdsourcing in their research process must either build their own website and user base from scratch, partner with existing citizen science organizations (e.g., Zooinverse), or leverage paid crowdsourcing platforms such as Mechanical Turk, which requires a fair degree of technical sophistication to use effectively.

We introduce Curio (http://crowdcurio.com [1]), a platform for crowdsourcing research tasks in the sciences and humanities. Curio is designed to allow researchers, who are domain experts but not necessarily technically savvy or familiar with crowdsourcing, to create and launch a new crowdsourcing project with minimal effort, and to draw on the curiosity of a mixed-expertise crowd to generate data towards testing specific hypotheses. In this talk, I will discuss findings from a set of 18 interviews conducted with researchers in the natural, social and medical sciences, and how these findings inspire the way we design Curio and re-think crowdsourcing in the scientific domains.

Edith Law is a CRCS postdoctoral fellow at the School of Engineering and Applied Sciences at Harvard University. She graduated from Carnegie Mellon University in 2012 with Ph.D. in Machine Learning, where she studied human computation systems that harness the joint efforts of machines and humans. She is a Microsoft Graduate Research Fellow, co-authored the book "Human Computation" in the Morgan & Claypool Synthesis Lectures on Artificial Intelligence and Machine Learning, co-organized the Human Computation Workshop (HCOMP) Series at KDD and AAAI from 2009 to 2012, and helped create the first AAAI Conference on Human Computation and Crowdsourcing. Her work on games with a purpose and large-scale collaborative planning has received best paper honorable mentions at CHI.

Amit Mukherjee

Mukherjee, Amit

Computational Post Doctoral Fellow, Cold Spring Harbor Lab NY

Whole-Brain Circuit Mapping in Mouse

The Mouse Brain Architecture Project aims at uncovering the neuronal connectivity of the mouse brain at a mesoscopic spatial scale. The project employs a grid-based approach to map neuronal projections between regions of the mouse brain using four tracers (two retrograde, two anterograde). Each individual mouse brain is injected at one of the selected grid points using one of four neuronal tracers. The brains are subsequently cryo-sectioned and processed through a high throughput histology and imaging pipeline using slide scanning optical microscopy, enabling the neuroanatomist to observe and resolve cellular/process level complexity across the whole brain. The experiments have created peta-byte scale image data that is automatically processed for open access on the web, providing a virtual microscope into these brains. The data are subject to computational neuroanatomical methods to provide a comprehensive picture of neuronal projection patterns across the mouse brain. The talk will detail ongoing work with emphasis on the associated big-data challenges.

Amit Mukherjee is a computational Post doctoral fellow at the Cold Spring Harbor lab NY. He completed his Masters in Electrical Engineering at the University of Houston, Houston Texas in 2004 and PhD in Electrical Engineering at Rensselaer Polytechnic Institute Troy NY in 2011. His research interest primarily includes signal and image processing, computer vision, pattern recognition, and neural networks.

Alex Paciorkowski

Paciorkowski, Alex

Senior Instructor in Neurology, Biomedical Genetics and Pediatrics, University of Rochester, School of Medicine and Dentistry

Autism Genetics and the Challenges of Curating Brain Development

Understanding the pathogenesis of autism is directly linked to elucidating the genetics of brain development. Using examples from FOXG1-related disorders, a genome-wide search for genes causing agenesis of the corpus callosum, and data from whole exome sequencing, Dr. Paciorkowski will discuss the challenges involved from a biological and informatics perspective.

Dr. Alex Paciorkowski is Assistant Professor of Neurology, Pediatrics, and Biomedical Genetics at the University of Rochester Medical Center. He trained in pediatrics and medical genetics at the University of Connecticut, and Child Neurology at Washington University in St. Louis. Following post-doctoral training in the lab of Dr. William Dobyns in Seattle, he was recruited to URMC in 2012. The Paciorkowski Lab is focused on understanding the genetic causes of developmental disorders of the brain, including autism, childhood epilepsy, brain malformations, and movement disorders. Projects in the lab use a number of tools, including molecular genetics, developing in vitro models, and the analysis of next-generation sequencing data. The lab also creates new bioinformatics tools to integrate the neurogenetics knowledge base.


Salakhutdinov, Ruslan

Assistant Professor, Department of Computer Science and Department of Statistics, University of Toronto

Recent Advances in Deep Learning: Learning Structured, Robust, and Multimodal Models

Building intelligent systems that are capable of extracting meaningful representations from high-dimensional data lies at the core of solving many Artificial Intelligence tasks, including visual object recognition, information retrieval, speech perception, and language understanding.

In this talk I will first introduce a broad class of hierarchical probabilistic models called Deep Boltzmann Machines (DBMs) and show that DBMs can learn useful hierarchical representations from high-dimensional data with applications in information retrieval, object recognition, speech perception, and collaborative filtering. I will then describe a new class of more complex models that combine Deep Boltzmann Machines with structured hierarchical Bayesian models and show how these models can learn a deep hierarchical structure for sharing knowledge across hundreds of visual categories, which allows accurate learning of novel visual concepts from few examples. Finally, I will introduce deep models that are capable of extracting a unified representation that fuses together multiple data modalities. I will show that on several tasks, including modeling images and text, video and sound, these models significantly improve upon many of the existing techniques.

Ruslan Salakhutdinov received his PhD in machine learning (computer science) from the University of Toronto in 2009. After spending two post-doctoral years at the Massachusetts Institute of Technology Artificial Intelligence Lab, he joined the University of Toronto as an Assistant Professor in the Department of Computer Science and Department of Statistics.

Dr. Salakhutdinov's primary interests lie in statistical machine learning, Bayesian statistics, probabilistic graphical models, and large-scale optimization. He is the recipient of the Early Researcher Award, Connaught New Researcher Award, Alfred P. Sloan Research Fellowship, Microsoft Research Faculty Fellowship, and a Fellow of the Canadian Institute for Advanced Research.

Marc Schieber

Schieber, Marc H.

Professor of Neurology, of Neurobiology, and of Biomedical Engineering, University of Rochester
Attending Neurologist, Unity Health, Rochester, NY

The BigData Needs of Neuro-Prosthetics

Recent years have seen the emergence of an effort to develop brain computer interfaces (BCIs). Such systems enable a waking subject to control an external device voluntarily via signals recorded directly from the subject’s own nervous system. Multiple scales of electrophysiological signals can be used: magneto-encephalographic (MEG) recorded trans-cranially, electroencephalographic (EEG) recorded from the scalp, electrocorticographic (ECoG) recorded from the surface of the cerebral cortex, local field potentials (LFP) recorded from penetrating electrodes, multi-unit spiking activity (MUA) recorded from several neighboring neurons, and single unit activity (SU) recorded from individual neurons. Optogenetics offer another potential means of recording from selective populations of neurons. In addition, signals can be recorded from peripheral nerves or muscles.

Whatever the source, signals from multiple channels are transformed in real time by an algorithm to enable the subject to control an external device, which may range from a cursor on a computer screen to a state-of-the-art mechanical arm and hand. In general, the greater the number of separate data channels that can be collected and processed, the better the control achieved by the subject. A tradeoff exists, however, between the sampling frequency needed to record signals accurately, and the information content of the signal: SUs typically are the most informative, but must be sampled at 30-40 kHz and “sorted”; LFPs are intermediate and typically are sampled at 500 Hz to provide un-aliased data up to 250 Hz; passive filtering by the scalp limits both the frequency content of EEG to < 50 Hz and the information content.

In 2013, the state-of-the-art collects SU and MUA from ~200 channels simultaneously to enable a human subject to control 7 degrees of freedom (DoF) in a mechatronic arm/hand that has > 20 DoF. This entails collecting and processing data at a rate of ~1 MB/s. Our capacity to handle BigData in real-time will need to increase if the number of controlled DoFs is to increase. Moreover, this capacity will need to be achieved in miniaturized, low-power hardware if subjects are to be able to “wear” their devices (or have the devices implanted!) for mobile, everyday activities.

Marc H. Schieber received his A.B. in 1974, and M.D./Ph.D. in 1982, from Washington University in St. Louis. He currently is Professor of Neurology, of Neurobiology, and of Biomedical Engineering at the University of Rochester, and Attending Neurologist at Unity Health, Rochester, NY. His research focuses on how the nervous system controls muscles to perform dexterous finger movements. Dr. Schieber is a member of the Society for Neural Control of Movement and the Society for Neuroscience. He has received an NINDS Javits Investigator Merit Award and has served as Chair of the NIH Sensorimotor Integration Study Section.

Vikas Sindhwani

Sindhwani, Vikas

Research Staff Member, Machine Learning group at IBM T.J. Watson Research Center

Finding Non-linear Structure in Big Data: Randomized Algorithms for Large-scale Kernel Methods

The promise of Big Data resides in being able to unlock very complex dependencies in a given domain. As data sizes increase, it is critical to relax strong parametric assumptions on the nature of the underlying dependency, and let the data "speak for itself". Kernel methods offer a rigorous non-parametric framework to extend a broad class of linear statistical algorithms to nonlinear modeling settings. Their scalability, however, remains a significant challenge. I will present recent work on a family of randomized and quasi-randomized algorithms for approximating kernel functions using low-dimensional explicit feature maps that admit highly scalable (and parallelizable) solutions. Using these methods, we obtain near state-of-the-art performance on large-scale speech recognition and computer vision tasks.

Vikas Sindhwani is a research staff member in the Machine Learning group at IBM T.J. Watson Research Center. His research interests include design and implementation of learning algorithms and numerical optimization techniques on Big Data; non-parametric modeling with kernel methods; and applications in various domains including social media analytics and recommender systems. He has over 50 publications in these areas including a best paper award at UAI 2013. His received his doctoral degree in Computer Science from the University of Chicago in 2007, and a Bachelors degree in Engineering Physics from the Indian Institute of Technology, Mumbai, India in 2001.

Ellen Voorhees

Voorhees, Ellen

Senior Computer Scientist, U.S. National Institute of Standards and Technology (NIST)

Metrology for Big Data

Lord Kelvin is reported to have observed "If you can not measure it, you can not improve it." The Text REtrieval Conference (TREC) project at the National Institute of Standards and Technology has created standard test sets and evaluation methodology to support the development of methods for content-based access to material structured for human, rather than machine, consumption for more than 20 years. Starting with (massive-for-the-time) two gigabytes of newswire text in 1992 and progressing to web-scale data collections, TREC has examined a variety of tasks including question answering, retrieving digital video, web search, legal discovery, secondary use of electronic health records, and sentiment analysis in blogs and tweets.

TREC’s “coopetition” paradigm emphasizes individual experiments evaluated on a benchmark task that leverages a relatively modest investment in infrastructure into a significantly greater amount of research and development. This has had three major impacts: improved effectiveness of information access algorithms; cross-fertilization of ideas across research groups with the eventual transfer of technology into products; and the formation of new research areas enabled by the construction of critical infrastructure.

Ellen Voorhees is a Senior Computer Scientist in the Information Technology Laboratory at the U.S. National Institute of Standards and Technology (NIST) where she leads the Text REtrieval Conference (TREC) project. Her research focuses on developing and validating appropriate evaluation schemes to measure system effectiveness for information access tasks, especially when those tasks involve data not specifically structured for machine use. She has a Ph.D. in Computer Science from Cornell University, and was previously a member of technical staff at Siemens Corporate Research where her work on intelligent agents for information access was awarded three U.S. patents.

Fei Wang

Wang, Fei

Research Staff Member, Healthcare Analytics Research group, IBM T. J. Watson Research Center

Feature Engineering for Predictive Modeling with Large Scale Electronic Medical Records: Augmentation, Densification and Selection

Fei Wang’s major research interests include machine learning, data mining, social informatics and healthcare informatics. He has published over 100 papers on the top venues of the relevant fields.

Predictive modeling lies in the heart of many medical informatics problems, such as early detection of some chronic diseases and patient hospitalization/readmission prediction. The data those predictive models are built upon are Electronic Medical Records (EMR), which are systematic collection of patient information including demographics, diagnosis, medication, lab tests, etc. We refer those information as patient features. High quality features are of vital importance to building successful predictive models. However, the features extracted directly from EMRs are typically noisy, heterogeneous and very sparse. In this talk, I will present a feature engineering pipeline on how to construct effective features from those EMRs, which includes three steps: (1) feature augmentation, constructing more effective derived features based on existing features; (2) feature densification, imputes the missing feature values; (3) feature selection, identify the most representative and predictive features. I will also show the empirical results on predictive modeling for the onsets of real world Congestive Heart Failure patients to demonstrate the advantages of the proposed pipeline.

Bin Zhang

Zhang, Bin

Assistant Professor, Department of Management Information Systems, Temple University

Comparing Peer Influences in Large Social Networks

The prevalence of social networks analysis has made technology diffusion an important topic in the Information Systems literature. Resent research suggests that adoption by individuals can be predicted not only from their personal tastes and characteristics, but also from the preferences of people who are close to them in their networks. Most models addressing these issues only consider one operative network. In reality there are often several networks influencing a targeted population, for example, friendship and colleagueship. However, researchers faces two challenges before they can compare multiple network influences on individuals' behaviors and attitudes in large social network contexts are high heterogeneity across subsets of the network and computing cost of processing the whole network. I developed a novel technique that can efficiently extract high quality sub graphs from large-scale networks so that the researcher can analyze these sub graphs as stand-alone subpopulations, instead of analyzing the whole population. In addition, I developed a hierarchical, multiple network-regime autocorrelation model for this class of problem and propose two algorithms for fitting it, one based on Expectation-Maximization (E-M) approach and the other on a Bayesian model using Markov Chain Monte Carlo (MCMC). I also illustrate how this approach can be applied to real social networks to compare the explanatory power of cohesion versus structural equivalence for social influence.

Bin Zhang’s primary research interests are social network analysis and business analytics. He is currently designing new algorithms and statistical models to analyze large-scale social networks, and studying their applications in technology diffusion and online social media. Bin has played key roles in research projects for NASA, NSF, and the Louisiana Department of Transportation and Development.