Track: Papers in Production: Modern CS in the Real World

Location: Cyril Magnin III

Day of week: Tuesday

What are the papers making a real-world impact today? This track looks at important papers that are influencing and changing software today. We're exploring topics around speech, infrastructure, self-driving cars, GANs, probabilistic data structures, and more on deep learning. The Papers In Production track aims to show research that is being used in production.

Track Host: Sid Anand

Chief Data Engineer @PayPal

Sid Anand currently serves as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid spends time with his wife, Shalini, and their 2 kids.

10:40am - 11:20am

Petastorm: A Light-Weight Approach to Building ML Pipelines

Data produced and managed by Big Data systems like Apache Spark and Hive cannot be directly consumed by Deep Learning systems like Tensorflow and PyTorch. Petastorm bridges this gap by enabling direct consumption of data in Apache Parqet format into Tensorflow and PyTorch. In this talk, we describe how Petastorm facilitates tighter integration between Big Data and Deep Learning worlds; simplifies data management and data pipelines; and speeds up model experimentation.

Yevgeni Litvin, Tech Lead @Uber

11:40am - 12:20pm

Scaling Emerging AI Applications with Ray

The next generation of AI applications will continuously interact with the environment and learn from these interactions. To develop these applications, data scientists and engineers will need to seamlessly scale their work from running interactively to production clusters. In this talk, I’ll cover some major open source AI + Data Science libraries my collaborators and I at the RISELab have been working on.


At a high level, I’ll talk about my work on the following: Ray, a distributed execution framework for emerging AI applications; Tune, a scalable hyperparameter optimization framework for reinforcement learning and deep learning; RLlib, an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones; and Modin, an open-source dataframe library for scaling pandas workflows by changing one line of code.

Peter Schafhalter, Research Assistant @ucbrise

1:20pm - 2:00pm

Scaling Deep Learning

NERSC has successfully applied Deep Learning to a range of scientific workloads. Motivated by the volume and complexity of scientific datasets, and the computationally demanding nature of DL, we have undertaken several projects targeted at scaling DL on the largest CPU and GPU-based systems in the world. This talk will explore 2D and 3D convolutional architectures for solving pattern classification, regression and segmentation problems in high-energy physics, cosmology and climate science. Our efforts have resulted in a number of first-time results: scaling Caffe to 9600 Cori/KNL nodes obtaining 15PF performance (SC’17), scaling TensorFlow to 8192 Cori/KNL nodes obtaining 3.5PF performance (SC’18), and finally, scaling TensorFlow to 4560 Summit/Volta nodes, obtaining 1EF performance (SC’18). The talk will review lessons learnt from these projects, and outline future challenges for the DL community.

Prabhat , Data and Analytics Group Lead @NERSC

2:20pm - 3:00pm

FAIR : Advances in Speech at Facebook

Presentation details will follow soon.

Vitaliy Liptchinsky, Research Engineering Manager @Facebook

3:20pm - 4:00pm

Modern CS in the Real World Panel Discussion

Panel details will follow soon.

4:20pm - 5:00pm

Building Data Products for Social Good

Facebook partners with humanitarian and academic organizations, as well as community-driven projects, like OpenStreetMap, on a number of Data for Good efforts. Examples of the outputs are

  • the High-Resolution Settlement Layer, for which we identify the locations of human-built structures from high-resolution satellite images and add population data to it in collaboration with Columbia University,
  • our Disaster Maps, which contain aggregated, anonymized information about the availability of network coverage and power availability, as well as human mobility in the context of natural disasters,
  • our large-scale input into OpenStreetMap, for which we detect roads from high-resolution satellite images, prepare them for human review, and feed the results into OpenStreetMap

We will present details about the methods, challenges, and community feedback involved in producing these datasets, as well as the impact they've each had over the last two years.

Andreas Gros, Data Scientist @Facebook
Shankar Iyer, Data Scientist @Facebook

2019 Tracks

  • Groking Timeseries & Sequential Data

    Techniques, practices, and approaches around time series and sequential data. Expect topics including image recognition, NLP/NLU, preprocess, & crunching of related algorithms.

  • Deep Learning in Practice

    Deep learning use cases around edge computing, deep learning for search, explainability, fairness, and perception.