# Schedule

8:00 AM

9:15 AM

9:30 AM

### Terlaky (Lehigh): 60 Years of Interior Point Methods: From Periphery to Glory

The basic concepts of Interior Point Methods (IPMs) were introduced by Frish in 1950’s, and further developed in the 1960’s, among others by Fiacco-McCormick (SUMT) and Dikin (Affince scaling). By the early 70’s it was concluded that, mostly due to numerical instability, IPMs most probably will not be viable algorithms for solving large scale optimization problems.

Karmarkar’s 1984 paper and the subsequent “Interior Point Revolution” fundamentally changed the landscape of optimization. IPMs become the method of choice to solve large-scale linear optimization problems, new classes of conic and convex optimization problems become efficiently solvable. The new powerful algorithmic and software tools opened new areas of applications.  In this talk we walk through the history of IPMs, highlight the scientific and computer technology advances that make the Interior Point revolution possible.

10:30 AM

11:00 AM

### Obozinski (École des Ponts - ParisTech): An SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning

I'll present an efficient dual augmented Lagrangian formulation to learn conditional random field (CRF) models. The algorithm, which can be interpreted as an inexact gradient method on the multiplier, does not require to perform exact inference iteratively, requires only a fixed number of stochastic clique-wise updates at each epoch to obtain a sufficiently good estimate of the gradient w.r.t. the Lagrange multipliers. We prove that the proposed algorithm enjoys global linear convergence for both the primal and the dual objective. Our experiments show that the proposed algorithm outperforms state-of-the-art baselines in terms of speed of convergence. (Joint work with Shell Xu Hu)

11:30 AM

### Jaggi (EPFL): Learning in a Distributed and Heterogeneous Environment

We discuss recent directions in optimization algorithms used for the training of machine learning systems, such as generalized linear models (regression, classification) and deep learning. For distributed optimization when using many machines, as well as for integrated compute devices with greatly varying compute and memory capacities (such as GPUs paired with regular compute nodes), we present ideas from convex optimization which help greatly accelerating training. In particular, we will employ importance sampling methods and primal-dual gap techniques.

Joint work with Celestine Dünner, Thomas Parnell, Anant Raj and Sebastian Stich.

12:00 PM

12:15 PM

1:45 PM

### Richtárik (KAUST): Stochastic Reformulations of Linear and Convex Feasibility Problems: Algorithms and Convergence Theory

We develop a family of reformulations of an arbitrary consistent linear system into a stochastic problem. The reformulations are governed by two user-defined parameters: a positive definite matrix defining a norm, and an arbitrary discrete or continuous distribution over random matrices. Our reformulation has several equivalent interpretations, allowing for researchers from various communities to leverage their domain specific insights. In particular, our reformulation can be equivalently seen as a stochastic optimization problem, stochastic linear system, stochastic fixed point problem and a probabilistic intersection problem. We prove sufficient, and necessary and sufficient conditions for the reformulation to be exact.

Further, we propose and analyze three stochastic algorithms for solving the reformulated problem---basic, parallel and accelerated methods---with global linear convergence rates. The rates can be interpreted as condition numbers of a matrix which depends on the system matrix and on the reformulation parameters. This gives rise to a new phenomenon which we call stochastic preconditioning, and which refers to the problem of finding parameters (matrix and distribution) leading to a sufficiently small condition number. Our basic method can be equivalently interpreted as stochastic gradient descent, stochastic Newton method, stochastic proximal point method, stochastic fixed point method, and stochastic projection method, with fixed stepsize (relaxation parameter), applied to the reformulations.

I will also comment on extensions of these ideas to convex feasibility. Time permitting, I may comment on the implications this work has for quasi-Newton methods, and on modified methods utilizing the heavy ball momentum.

The talk is mainly based on the following papers:

[1] Peter Richtárik and Martin Takáč , Stochastic reformulations of linear systems: algorithms and convergence theory, arXiv:1706.01108, 2017

[2] Ion Necoara, Andrei Patrascu and Peter Richtárik . Randomized projection methods for convex feasibility problems: conditioning and convergence rates, arXiv:1801.04873, 2018

Further related papers:

[3] Robert M. Gower and Peter Richtárik. Randomized iterative methods for linear systemsSIAM Journal on Matrix Analysis and Applications 36(4), 1660-1690, 2015

[4] Robert M. Gower and Peter Richtárik. Stochastic dual ascent for solving linear systems, arXiv:1512.06890, 2015

[5] Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods, arXiv:1712.09677, 2017

[6] Robert M. Gower and Peter Richtárik. Randomized quasi-Newton updates are linearly convergent matrix inversion algorithmsSIAM Journal on Matrix Analysis and Applications 38(4), 1380-1409, 2017

2:15 PM

4:00 PM

### Fercoq (Télécom ParisTech): Convergence Speed of a Primal-Dual Coordinate Descent Method

We introduce a randomized coordinate descent version of the Vu-Condat algorithm. By coordinate descent, we mean that only a subset of the coordinates of the primal and dual iterates is updated at each iteration, the other coordinates being maintained to their past value. Our method allows us to solve optimization problems with a combination of differentiable functions, constraints as well as non-separable and non-differentiable regularizers.

We show that the sequences generated by our algorithm almost surely converge to a saddle point of the problem at stake, for a wider range of parameter values than previous methods. In particular, the condition on the step-sizes depends on the coordinate-wise Lipschitz constant of the differentiable function's gradient, which is a major feature allowing classical coordinate descent to perform so well when it is applicable.

We then prove a sublinear rate of convergence in general and a linear rate of convergence if the objective enjoys strong convexity properties.

We illustrate the performances of the algorithm on a total-variation regularized least squares regression problem, support vector machine problems and distributed optimization.

4:30 PM

### Terlaky (Lehigh): A Polynomial-time Rescaled von Neumann Algorithm for Linear Feasibility Problems

The perceptron and von Neumann algorithms are known to be closely related, like duals. A deterministic rescaled version of the perceptron algorithm was proved to be polynomial by Pena and Soheil. Recently, Chubanov proposed a method which solves homogeneous linear equality systems with positive variables in polynomial time. Chubanov's method can be considered as a column-wise rescaling procedure. We adapt Chubanov's method to the von Neumann problem, and so we design a polynomial time column-wise rescaling von Neumann algorithm. This algorithm is the first variant of the von Neumann algorithm with polynomial complexity.

Joint work with Dan Li and Kees Roos

6:00 PM

6:30 PM

### Banquet Dinner at Al-Marsa Restaurant (by invitation only)

Dinner for workshop speakers, poster presenters, industrial visitors and invited guests.

8:30 AM

9:30 AM

### Scheinberg (Lehigh): Direct and Efficient Optimization of Prediction Error and AUC of Linear Classifiers

The predictive quality of most machine learning models is measured by expected prediction error or so-called Area Under the Curve (AUC). However, these functions are not used in the empirical loss minimization, because their empirical approximations are  nonconvex and discountinuous, and more importantly have zero derivative almost everywhere. Instead, other loss functions are used, such as logistic loss. In this work, we show that in the case of linear predictors, and under the assumption that the data has normal distribution, the expected error and the expected AUC are not only smooth, but have well defined derivatives, which depend on the first and second moments of the distribution. We show that these derivatives can be approximated  and used in empirical risk minimization, thus proposing a gradient-based optimization methods for direct optimization of prediction error and AUC. Moreover the proposed algorithm has no dependence on the size of the dataset, unlike logistic regression and all other well known empirical risk minimization techniques.

10:00 AM

### Ghanem (KAUST): FFTLasso: Large-Scale Lasso in the Fourier Domain

​In this talk, I will focus on FFTLasso, a new solver for large scale LASSO problems developed at the Image and Video Understanding Lab (IVUL) at KAUST. The LASSO sparse representation problem has been studied and used in a variety of different areas, ranging from signal processing and information theory to computer vision and machine learning. In the vision community, it found its way into many important applications, including face recognition, tracking, super resolution, image denoising, to name a few. Despite advances in efficient sparse algorithms, solving large-scale LASSO problems remains a challenge. To circumvent this difficulty, people tend to downsample and subsample the problem (e.g. via dimensionality reduction) to maintain a manageable sized LASSO, which usually comes at the cost of losing solution accuracy. FFTLasso is proposed as a novel circulant reformulation of the LASSO that lifts the problem to a higher dimension, where ADMM can be efficiently applied to its dual form. Because of this lifting, all optimization variables are updated using only basic element-wise operations, the most computationally expensive of which is a 1D FFT. In this way, there is no need for a linear system solver nor matrix-vector multiplication. Since all operations in FFTLasso are element-wise in the Fourier domain, the sub-problems are completely independent and can be trivially parallelized (e.g. on one or more GPUs). I will highlight the efficiency gain of FFTLasso on both synthetic and real large-scale data and showcase its use on the face recognition task.

10:30 AM

11:00 AM

### Tanner (Oxford): Sparse Non-Negative Super-Resolution: Simplified and Stabilized

Super-resolution is a technique by which one seeks to overcome the inherent accuracy of a measurement device by exploiting further information.  Applications are very broad, but in particular these methods have been used to great effect in modern microscopy methods and underpin recent Nobel prizes in chemistry.  This topic has received a renewed theoretical interest starting in approximately 2013 where notions from compressed sensing were extended to this continuous setting.  The simplest model is to consider a one dimensional discrete measure @%\mu = \sum_{j=1}^{k} \alpha_j \delta_{t_j}%@ which models @%k%@ discrete objects at unknown locations @%t_j%@ and unknown amplitudes @%\alpha_j%@ (typically with non-negative amplitudes). The measurement device can be viewed as a burring operator, where each discrete spike is instead replaced a function @%\psi(s,t_j)%@ such as a Gaussian @%\exp(-\sigma |st_j|)%@, in which case one can make measurements of the form @%y(s)=\psi(s,t)\star\mu=\sum_{j=1}^k \alpha_j \psi(s,t_j)%@.  Typically one measures @%m>2k+1%@ discrete values; that is @%y(s_i)%@ for @%i=1,\ldots, m%@. The aim is then to recover the @%2k%@ parameters @%{t_j}_{j=1}^k%@ and @%{\alpha_j}_{j=1}^k%@ from the $m$ samples and knowledge of @%\psi(s,t)%@. In this talk we extend recent results by Schiebinger, Robava, and Recht to show that the @%2k%@ parameters are uniquely determined by their @%2k+1%@ samples, and that any solution consistent with the measurement within @%\tau%@ is proportionally consistent with the original measure. This work is joint with A. Eftekhari, J. Tanner, A. Thompson, B. Toader, and H. Tyagi.

11:30 AM

### Heidrich (KAUST): Optimization and Big Data in Computational Imaging

Computational Imaging aims to develop new cameras and imaging modalities that optically encode information about the real world in such a way that it can be captured by image sensors. The resulting images represent detailed information such as scene geometry, motion of solids and liquids, multi-spectral information, or high contrast (high dynamic range), which can then be computationally decoded using inverse methods, machine learning, and numerical optimization. Computational Displays use a similar approach, but in reverse. Here, the goal is to computationally encode a target image that is then optically decoded by the display hardware for presentation to a human observer. Computational displays are capable of generating glasses-free 3D displays, high dynamic range imagery, or images and videos with spatial and/or temporal super-resolution. In this talk I will give an overview of recent advances and current challenges in rapidly expanding research area, with a specific emphasis on the utilization of optimization and big data methods.​

12:00 PM

1:45 PM

### Dutta (KAUST): Online and Batch Supervised Background Estimation via L1 Regression

We propose a surprisingly simple model for supervised video background estimation. Our model is based on L1 regression. As existing methods for L1 regression do not scale to high-resolution videos, we propose several simple and scalable methods for solving the problem, including iteratively reweighted least squares, a homotopy method, and stochastic gradient descent. We show through extensive experiments that our model and methods match or outperform the state-of-the-art online and batch methods in virtually all quantitative and qualitative measures.

2:15 PM

4:00 PM

### Hauser (Oxford): Emitter Array Tomosynthesis via Nonlinear Compressed Sensing

A new generation of 3D tomography systems is based on multiple emitters and sensors that partially convolve measurements. A successful approach to deconvolve the measurements is to use nonlinear compressed sensing models. We will discuss two different nonlinear compressed sensing models and algorithms to deconvolve the measurements in a high dimensional setting, resulting in 3D image reconstruction.

4:30 PM

### Takáč (Lehigh): SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient

In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. The convergence rate for convex and non-covex case is also discussed. Numerical experiments demonstrate the efficiency of our algorithm.

6:00 PM

6:30 PM

### Dinner at Bonjourain (by invitation only)

Dinner for workshop speakers, poster presenters and industrial visitors.

8:30 AM

9:30 AM

### Kim (GA Tech): Processing a Trillion-Edge Graph on a Single Machine

Processing a one trillion-edge graph has recently been demonstrated by distributed graph engines running on clusters of tens to hundreds of nodes. In this paper, we employ a single heterogeneous machine with fast storage media (e.g., NVMe SSD) and massively parallel coprocessors (e.g., Xeon Phi) to reach similar dimensions. By fully exploiting the heterogeneous devices, we design a new graph processing engine, named Mosaic, for a single machine. We propose a new locality-optimizing, space-efficient graph representation---Hilbert-ordered tiles, and a hybrid execution model that enables vertex-centric operations in fast host processors and edge-centric operations in massively parallel coprocessors. Our evaluation shows that for smaller graphs, Mosaic consistently outperforms other state-of-the-art out-of-core engines by 3.2-58.6x and shows comparable performance to distributed graph engines. Furthermore, Mosaic can complete one iteration of the Pagerank algorithm on a trillion-edge graph in 21 minutes, outperforming a distributed disk-based engine by 9.2×.

10:00 AM

### Canini (KAUST): In-Network Computation is a Dumb Idea Whose Time Has Come

Programmable networking hardware creates new opportunities for infusing intelligence into the network. This raises a fundamental question: what kinds of computation should be delegated to the network?
To answer, we turn our attention to modern machine learning workloads. Efficiently training complex machine learning models at scale requires high performance at the infrastructure level. With large models, communication among multiple workers becomes a scalability concern due to limited bandwidth.

We propose to address this problem by redesigning communication in distributed machine learning to take advantage of programmable network data planes. Our key insight is to reduce the volume of exchanged data by performing in-network computation to aggregate the model’s parameter updates as they are being transferred. However, in-network computation tasks must be judiciously crafted to match the limitations of the network machine architecture of programmable devices. With the help of our experiments on machine learning workloads, we identify that aggregation functions raise opportunities to exploit the limited computation power of networking hardware to lessen network congestion and improve the overall application performance. Moreover, as a proof-of-concept, we propose DAIET, a system that performs in-network data aggregation. Experimental results with an initial prototype show a large data reduction ratio (86.9%-89.3%) and a similar decrease in the workers' computation time.

10:30 AM

11:00 AM

### Kalnis (KAUST): Pivoted Subgraph Isomorphism – The Optimist, the Pessimist and the Realist

​Identifying frequent subgraphs is a core operation for many analytics and machine learning algorithms on graphs. Frequent subgraph mining is computationally expensive rendering many graph analytics algorithms prohibitive for large graphs. Our group developed SmartPSI, a frequent subgraph mining algorithm based on the notion of pivoted subgraph isomorphism. Within SmartPSI we propose two radically different algorithms, called optimistic and pessimistic, each one suitable for different inputs. We also include a classifier based on machine learning, that is trained on-the-fly to decide dynamically which algorithm to execute for each part of the input graph. Finally, we implement an optimizer that generates for each algorithm a low-cost execution plan. Our experimental evaluation with large real graphs reveals that SmartPSI achieves up to 6 times performance improvement compared to the state-of-the-art distributed frequent subgraph mining system.

11:30 AM

### Kandula (Microsoft Research): Approximate Answers for Complex Parallel Queries

Despite decades of research, approximations are not widely used in data analytics platforms. To understand why, an ideal approximate analytics system has to meet at least four goals: cover a large class of queries, offer much better latency and/ or throughput, have small overhead and offer accuracy guarantees. Whether such a system exists remains an open question. In this talk, I will describe alternate approaches that (a) introduce samplers as native SQL operators including samplers that can sample before a join and a group-by and (b) extend a cost-based query optimizer so as to improve the performance of plans with samplers without changing their accuracy. These techniques are used within Microsoft and are publicly available in the Azure Data Lake Analytics platform.

12:00 PM

1:45 PM

### Ehrhardt (Cambridge): Stochastic PDHG with Arbitrary Sampling and Applications to Medical Imaging

A very popular algorithm for image processing and image reconstruction with non-differentiable priors is the primal-dual hybrid gradient (PDHG) algorithm proposed by Chambolle and Pock. In some scenarios it is beneficial to employ a stochastic version of this algorithm where not all of the dual updates are executed simultaneously. It turns out that the stochastic version has convergence rates along the same lines as the deterministic PDHG. Numerical results on clinical positron emission tomography (PET) data show a dramatic speed up by the proposed method and thereby the impact this may have on clinical applications. This is joint work with A. Chambolle, P. Markiewicz, P. Richtárik, J. Schott and C. Schoenlieb.

2:15 PM

4:00 PM

### Lan (GATech): Communication-Efficient Methods for Decentralized and Stochastic Optimization

​Stochastic gradient descent (SGD) methods have recently found wide applications in large-scale data analysis, especially in machine learning. These methods are very attractive to process online streaming data as they only scan through the dataset only once but still generate solutions with acceptable accuracy. However, it is known that classical SGDs are ineffective in processing streaming data distributed over multi-agent network systems (e.g., sensor and social networks), mainly due to the high communication costs incurred by these methods.

In this talk, we present a new class of SGDs, referred to as stochastic decentralized communication sliding methods, which can significantly reduce the aforementioned communication costs for decentralized stochastic optimization and machine learning. We show that these methods can skip inter-node communications while performing SGD iterations. As a result, these methods require a substantially smaller number of communication rounds than existing decentralized SGDs, while the total number of required stochastic (sub)gradient computations are comparable to those optimal bounds achieved by classical centralized SGD type methods.

This talk is based on a joint work with Soomin Lee and Yi Zhou.

4:30 PM

### Necoara (Bucharest Politehnica): Conditions for Linear Convergence of (Stochastic) First Order Methods

For convex optimization problems deterministic first order methods have linear convergence provided that the objective function is smooth (Lipschitz continuous gradient) and strongly convex. Moreover, under the same conditions - smoothness and strong convexity -  stochastic first order methods have sublinear convergence rates. However, in many applications (machine learning, statistics, control, signal processing) the smoothness/strong convexity conditions do not hold; but the objective function still has a special structure (e.g. composition of a strongly convex function with a linear map). In this talk we replace the smoothness/strong convexity assumptions with several other conditions, that are less conservative, for which we prove that several (stochastic) first order methods are converging linearly.  We also provide necessary conditions for linear convergence of (stochastic) gradient method.  Finally, we provide examples of several functional classes satisfying our new conditions and discuss several applications of these results (Lasso problem, linear systems, linear programming, convex feasibility, etc).

5:15 PM

5:30 PM

5:30 PM

### Dinner at KAUST South Beach (by invitation only)

Dinner for workshop speakers, poster presenters and industrial visitors.

6:00 PM