# Biography

I am a Ph.D. student at the Université de Montréal (UdeM) coadvised by Yoshua Bengio and Hugo Larochelle. My primary area of interest is in the use of generative models for model-based reinforcement learning and reasoning algorithms. Before coming to UdeM, I spent a fantastic year at Google Research (in a seemingly never-ending internship) where I worked with Google Accelerated Sciences and Google Brain. Along with a growing number of researchers in machine learning, my prior background was in physics. I completed my M.S. in Physics at UC San Diego where I was coadvised by David Meyer and Gary Cottrell . Before that, I graduated from MIT in physics and researched techniques for directional dark matter detection with the DMTPC collaboration .

2017-Now
Université de Montréal
Ph.D. in computer science
Winter and Spring 2017
Summer and Fall 2016
Semantic segmentation in microscopy images
Summer 2015
Zillow Group Internship
Deep CNNs and LSTMs for image quality detection
2013-2017
UC San Diego: Master's Degree
CMS Group at CERN/Machine learning
2010-2013
Fidelity Management and Research Company
Equity research associate
2008-2009
Cambridge University
2006-2010
Massachusetts Institute of Technology: Bachelor's Degree
Major in Physics (8A for you MIT nerds).

## Research

Disentangling the independently controllable factors of variation by interacting with the world
It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to interact with its environment. The agent can experiment with different actions and observe their effects. More specifically, we hypothesize that some of these factors correspond to aspects of the environment which are independently controllable, i.e., that there exists a policy and a learnable feature for each such aspect of the environment, such that this policy can yield changes in that feature with minimal changes to other features that explain the statistical variations in the observed data. We propose a specific objective function to find such factors, and verify experimentally that it can indeed disentangle independently controllable aspects of the environment without any extrinsic reward signal.
Valentin Thomas∗, Emmanuel Bengio∗, William Fedus∗, Jules Pondard, Philippe Beaudoin, Hugo Larochelle, Joelle Pineau, Doina Precup, Yoshua Bengio
NIPS 2017 Workshop
Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step
Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each player cannot reduce their cost without changing the other players' parameters. One useful approach for the theory of GANs is to show that a divergence between the training distribution and the model distribution obtains its minimum value at equilibrium. Several recent research directions have been motivated by the idea that this divergence is the primary guide for the learning process and that every step of learning should decrease the divergence. We show that this view is overly restrictive. During GAN training, the discriminator provides learning signal in situations where the gradients of the divergences between distributions would not be useful. We provide empirical counterexamples to the view of GAN training as divergence minimization. Specifically, we demonstrate that GANs are able to learn distributions in situations where the divergence minimization point of view predicts they would fail. We also show that gradient penalties motivated from the divergence minimization perspective are equally helpful when applied in other contexts in which the divergence minimization perspective does not predict they would be helpful. This contributes to a growing body of evidence that GAN training may be more usefully viewed as approaching Nash equilibria via trajectories that do not necessarily minimize a specific divergence at each step.
William Fedus*, Mihaela Rosca*, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, Ian Goodfellow
ICLR 2018
MaskGAN: Better Text Generation via Filling in the ______
Neural text generation models are often autoregressive language models or seq2seq models. Neural autoregressive and seq2seq models that generate text by sam- pling words sequentially, with each word conditioned on the previous model, are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of sample quality. Language models are typically trained via maximum likelihood and most often with teacher forcing. Teacher forcing is well-suited to optimizing perplexity but can result in poor sample quality because generating text requires conditioning on sequences of words that were never ob- served at training time. We propose to improve sample quality using Generative Adversarial Networks (GANs), which explicitly train the generator to produce high quality samples and have shown a lot of success in image generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them. We introduce an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively, evidence that this produces more realistic text samples compared to a maximum likelihood trained model.
William Fedus, Ian Goodfellow, Andrew M. Dai
ICLR 2018
In-silico Labeling: Training Networks to Predict Fluorescent Labels in Unlabeled Images
Predicting fluorescent labels of microscopy images. Full paper and abstract to be released upon publication.
E. Christiansen et al.
(Accepted to) Cell Journal
Persistent Homology for Mobile Phone Data Analysis
Topological data analysis is a new approach to analyzing the structure of high dimensional datasets. Persistent homology, specifically, generalizes hierarchical clustering methods to identify significant higher dimensional properties. In this project, we analyze mobile network data from Senegal to determine whether significant topological structure is present. We investigate two independent questions: whether the introduction of the Dakar motorway has any significant impact on the topological structure of the data, and how communities can be constructed using this method. We consider three independent metrics to compute the persistent homology. In two of these metrics, we see no topological change in the data given the introduction of the motorway; in the remaining metric, we see a possible indication of topological change. The behavior of clustering using the persistent homology calculation is sensitive to the choice of metric, and is similar in one case to the communities computed using modularity maximization.
W. Fedus et al.
NetMob 2015 Challenge
Background Rejection in the DMTPC Dark Matter Search Using Charge Signals
The Dark Matter Time Projection Chamber (DMTPC) collaboration is developing low-pressure gas TPC detectors for measuring WIMP-nucleon interactions. Optical readout with CCD cameras allows for the detection for the daily modulation in the direction of the dark matter wind, while several charge readout channels allow for the measurement of additional recoil properties. In this article, we show that the addition of the charge readout analysis to the CCD allows us too obtain a statistics-limited 90% C.L. upper limit on the e− rejection factor of 5.6 × 10−6 for recoils with energies between 40 and 200 keVee. In addition, requiring coincidence between charge signals and light in the CCD reduces CCD-specific backgrounds by more than two orders of magnitude.
J. Lopez et al.
Nuclear Instruments and Methods in Physics Research Sec A
DMTPC: Dark matter detection with directional sensitivity.
The Dark Matter Time Projection Chamber (DMTPC) experiment uses CF_4 gas at low pressure (0.1 atm) to search for the directional signature of Galactic WIMP dark matter. We describe the DMTPC apparatus and summarize recent results from a 35.7 g-day exposure surface run at MIT. After nuclear recoil cuts are applied to the data, we find 105 candidate events in the energy range 80 - 200 keV, which is consistent with the expected cosmogenic neutron background. Using this data, we obtain a limit on the spin-dependent WIMP-proton cross-section of 2.0 \times 10^{-33} cm^2 at a WIMP mass of 115 GeV/c^2. This detector is currently deployed underground at the Waste Isolation Pilot Plant in New Mexico.
J. Battat et al.
First Dark Matter Search Results from a Surface Run of the 10-L DMTPC Directional Dark Matter Detector.
The Dark Matter Time Projection Chamber (DMTPC) is a low pressure (75 Torr CF4) 10 liter detector capable of measuring the vector direction of nuclear recoils with the goal of directional dark matter detection. In this paper we present the first dark matter limit from DMTPC. In an analysis window of 80-200 keV recoil energy, based on a 35.7 g-day exposure, we set a 90% C.L. upper limit on the spin-dependent WIMP-proton cross section of 2.0 x 10^{-33} cm^{2} for 115 GeV/c^2 dark matter particle mass.
S. Ahlen et al.
Physics Letters B
The case for a directional dark matter detector and the status of current experimental efforts.
We present the case for a dark matter detector with directional sensitivity. This document was developed at the 2009 CYGNUS workshop on directional dark matter detection, and contains contributions from theorists and experimental groups in the field. We describe the need for a dark matter detector with directional sensitivity; each directional dark matter experiment presents their project's status; and we close with a feasibility study for scaling up to a one ton directional detector, which would cost around \$150M.
S. Ahlen et al.
International Journal of Modern Physics A

# Coursework

(CSE 291) Deep Neural Networks for Identity and Emotion Recognition
We train two neural networks on the POFA and NimStim datasetsto identify individuals and identify emotions, respectively. In order to train these neural networks, we use two separate optimization procedures, the minFunc pack- age and stochastic gradient descent.
William Fedus, Bobak Hashemi, Matthew Burns
(CSE 291) Efficient Encoding Using Deep Neural Networks
Deep neural networks have been used to efficiently encode high-dimensional data into low-dimensional representations. In this report, we attempt to reproduce the results of Hinton and Salakhutdinov. We use Restricted Boltzmann machines to pre-train, and standard backpropagation to fine-tune a deep neural network to show that such a network can efficiently encode images of handwritten digits. We also construct another deep autoencoder using stacked autoencoders and compare the performance of the two autoencoders.
Chaitanya Ryali, Gautam Nallamala, William Fedus and Yashodhara Prabhuzantye
(CSE 250B) Voting, Averaged and Kernalized Perceptrons
We employ various perceptron model for the classification of points.
William Fedus
(CSE 250B) Logistic Regression with Gradient Descent Optimization
Simple logistic regression on data drawn from multivariate gaussians optimized using gradient descent.
William Fedus
(CSE 250B) Handwritten Digit Classification via Generative Model
We use the MNIST data set to create ten 784-dimensional multivariate Gaussians for each of the handwritten digits. Each grayscale handwritten digit may be considered as a 784-dimensional vector.
William Fedus
(CSE 250B) Text Classification via Multinomial Naive Bayes
Using the 20 Newsgroups data set we train a multinomial Naive Bayes classifier to predict the class designation of a particular document composed of $n$ unique features $x_1,\dots,x_n$ where $n$ is the size of the Vocabulary.
William Fedus
(CSE 250B) Nearest Neighbor Classification
We train and test our algorithm on the MNIST dataset of handwritten digits which includes 60,000 training examples and 10,000 test examples. Each handwritten digit is encoded in a 28x28 gray scale image which can be flattened into a 784-dimensional vector, where the intensity of a particular pixel is the value along a certain dimension. In order to reduce the computational complexity of the classification, we seek a smaller set of prototypes within the full 60,000 training set.
William Fedus
(CSE 255) Link Prediction Among Suspected Terrorists
In this paper we train a logistic regression function for two forms of link prediction among a set of 244 suspected terrorists in a social network. We train and test on a dataset created at the University of Maryland and further modified at UCSD by Eric Doi and Ke Tang. The supposed terrorists have several labels for the nature of their links to other supposed terrorists; terrorists are classified as either colleagues, family, contacts, or congregates. Structural information about the known network connectivity of the supposed terrorists is integrated with additional binary information provided about the individuals to arrive at two final models. The first model predicts the existence of any type of link between two individuals and the second model classifies whether an existing link is 'colleague' or 'other'.
Alex Asplund, William Fedus
(CSE 255) Movie Sentiment Analysis
In this paper we train an L1-regularized linear support vector machine (SVM) to determine whether the sentiment of a movie review is positive or negative. We train and test on the movie review polarity dataset introduced by Pang and Lee, 2004. Classification accuracy of the linear SVM is improved through a series of experiments for various data preprocessing techniques and data transformations. Classification accuracy is found to be maximum on the 10 cross-validation folds after removing numerical entries and performing log odds weighting of terms.
Alex Asplund, William Fedus
(CSE 255) Collaborative Filtering Algorithm
In this paper we implement a collaborative filtering algorithm on the MovieLens dataset to predict movie ratings for the users. The original matrix, which contains the movie ratings on a 1-5 scale for the users, has many missing entries. Rank-K factorization is used to construct the filtering algorithm and alternating least squares is then performed on the two lower rank matrices in order to fill in the missing entries of the original matrix and thus predict all movie ratings for all users.
Alex Asplund, William Fedus
(PHYS 210A) Entropy in Classical and Quantum Information Theory
Entropy is a central concept in both classical and quantum information theory, measuring the uncertainty and the information content in the state of a physical system. This paper reviews classical information theory and then proceeds to generalizations into quantum information theory. Both Shannon and Von Neumann entropy are discussed, making the connection to compressibility of a message stream and the generalization of compressibility in a quantum system. Finally, the paper considers the application of Von Neumann entropy in entanglement of formation for both pure and mixed bipartite quantum states.
William Fedus

# Random

Arxiv Sanity for managing the deluge of machine learning papers
Surf reports for some of my favorite spots around San Diego, CA: Blacks, Bird Rock and Tourmaline. Also, a brief intro to surf lingo
Conquering, arguably, one of the most dangerous bike jumps in the world
Great set of educational links, the No Excuse List
Andrej Karpathy's Blog The Unreasonable Effectiveness of Recurrent Neural Networks and academic site (the clear inspiration for this site template)
Hinton's Coursera Lectures, full of excellent insights