# Learning faster from easy data II

Friday December 11, 2015
Room 511d

## Overview

In both stochastic and online learning we have a good theoretical understanding of the most difficult learning tasks through worst-case or minimax analysis, and we have algorithms to match. Yet there are commonly occurring cases that are much easier than the worst case where these methods are overly conservative, showing a large gap between the performance predicted by theory and observed in practice. Recent work has refined our theoretical understanding of the wide spectrum of easy cases, leading to the development of algorithms that are robust to the worst case, but can also automatically adapt to easier data and achieve faster rates whenever possible.

Examples of easier cases include (Tsybakov) margin conditions, low noise or variance, probabilistic Lipschitzness and empirical curvature of the loss (strong convexity, exp-concavity, mixability), as well as low-complexity decision boundaries and comparators, quantile bounds, and cases with few switches among few leaders. Adapting to such easy data often involves data-dependent bias-variance trade-offs through hyper-parameter learning, adaptive regularisation or exploration, or hypothesis testing to distinguish between easy and hard cases.

This workshop continues from the successful first Learning Faster from Easy Data workshop at NIPS 2013. The last two years have seen many exciting new developments in the form of new desirable adaptivity targets, new algorithms and new analysis techniques. In this workshop we aim to bring together researchers and practitioners interested in adaptation to easy data. The key questions we will discuss are:

• Which are the data-dependent trade-offs encountered (bias-variance or other)?
• Can we identify commonalities across different problem domains in strategies that are being used to deal with these trade-offs?
• And what is the price for adaptivity (if any)?

Both theoretical and empirical insights are welcomed.

## Call for papers

We invite the submission of abstracts to the workshop. Selected abstracts will be advertised on the workshop website and invited for a spotlight and poster presentation during the workshop.

Abstracts should be at most 4 pages long (not including references) in PDF format following the NIPS 2015 style. Submissions should not be anonymous. The organizers will review all submissions. Please send contributions by email to easydata2015@timvanerven.nl.

Submission of previously published work or work under review is allowed, including NIPS 2015 submissions, if clearly declared as such. However, preference will be given to novel work or work that has not yet been presented elsewhere (for example, recent journal publications).

## Important dates

• Submission deadline: October 10th 17th, 2015
• Notification of acceptance: October 24th 28th, 2015.
• Workshop: Friday December 11, 2015.

## Invited speakers

Peter Auer, University of Leoben Cancelled. New talk by Shai Ben-David
An optimal bandit algorithm for both stochastic and adversarial environments
9:30-10:00

We are interested in algorithms that perform well in (relatively easy) stochastic environments and also in adversarial environments. Our results, together with results of Bubeck and Slivkins quite completely characterize what such algorithms can achieve.

Bubeck and Slivkins showed that an algorithm with $$O(\log^2 n)$$ pseudo-regret in stochastic environments can achieve $$\tilde O(\sqrt{n})$$ regret in adversarial environments with high probability. We give an algorithm with $$O(\log n)$$ pseudo-regret and $$\tilde O(\sqrt{n})$$ expected regret in adversarial environments. Furthermore, we show that no algorithm with $$O(\log^a n)$$, $$a<2$$, pseudo-regret can achieve $$\tilde O(\sqrt{n})$$ regret with high probability in adversarial environments.

Our algorithm uses a weak test for non-stochastic arms, and repeats this test to achieve sufficient confidence. Our analysis capitalizes on the negative regret an adversary may allow when switching the best arm.

Shai Ben-David, Universitys of Waterloo
Clustering is Easy When . . . What? [slides]
9:30-10:00

It is well known that most of the common clustering objectives are NP-hard to optimize. In practice, however, clustering is being routinely carried out. One approach for providing theoretical understanding of this seeming discrepancy is to come up with notions of clusterability that distinguish realistically interesting input data from worst-case data sets. The hope is that there will be clustering algorithms that are provably efficient on such “clusterable” instances. This paper addresses the thesis that the computational hardness of clustering tasks goes away for inputs that one really cares about. In other words, that “Clustering is difficult only when it does not matter” (the CDNM thesis for short).

I would like to use the format of a workshop presentation to deviate from the conference paper style of focusing on cutting edge results. Instead, I wish to present a a critical bird’s eye overview of the results published on this issue so far and to call attention to the gap between available and desirable results on this issue. I start by discussing which requirements should be met in order to provide formal support to the the CDNM thesis. I then examine existing results in view of these requirements and list the most significant unsolved research challenges in that direction.

Aarti Singh, Carnegie Mellon University
Tsybakov noise adaptive margin-based active learning [slides]
10:30-11:00

Tysbakov noise condition characterizes the level of noise in labels in a classification setting, leading to faster rates for both passive and active learning when the noise is low. However, the noise level is typically unknown and it is desirable to design adaptive algorithms that are robust to high noise setting while leading to faster rates in low noise settings. In passive learning, this is straightforward and can be achieved by model selection. However, model selection for active learning is an open problem. In this talk, I will present a noise-robust margin-based active learning algorithm that can find homogeneous multi-dimensional linear separators while adapting to unknown level of noise and achieves optimal statistical rate up to polylogarithmic factors. The ideas have close connections and implications for adaptive convex optimization where one can achieve adaptivity to the degree of convexity.

Dylan Foster, Cornell University
11:00-11:30

There is a chasm between theory and practice in machine learning. Theoretically derived algorithms are often empirically outperformed by heuristic ones engineered for the specific problem. The reason for this chasm is that theoretically derived algorithms are pessimistic and are tailored from a worst case analysis. There has of course been a plethora of work on adaptive learning algorithms that obtain the best of both worlds, the robustness against worst case instances and the good performances obtained by tailor made algorithms facing easy data. However often these analysis are case by case and problem specific. In this work we attempt to develop a unifying theory for adaptive online learning which in a sense is a first step to developing a VC or PAC style theory for adaptive online learning. We propose a general framework for studying adaptive regret bounds in the online learning framework, including model selection bounds and data-dependent bounds. Given a data- or model-dependent bound we ask, "Does there exist some algorithm achieving this bound?" We show that modifications to recently introduced sequential complexity measures can be used to answer this question by providing sufficient conditions under which adaptive rates can be achieved. In particular each adaptive rate induces a set of so-called offset complexity measures, and obtaining small upper bounds on these quantities is sufficient to demonstrate achievability. A cornerstone of our analysis technique is the use of one-sided tail inequalities to bound suprema of offset random processes.

Our framework recovers and improves a wide variety of adaptive bounds including quantile bounds, second-order data-dependent bounds, and small loss bounds. In addition we derive a new type of adaptive bound for online linear optimization based on the spectral norm, as well as a new online PAC-Bayes theorem that holds for countably infinite sets.

Peter Grünwald, CWI and Leiden University
Easy Data [slides]
14:30-15:00

Optimal convergence rates in i.i.d. learning problems are determined by model complexity and a parameter indicating the inherent 'easiness' of the problem. As recently shown [1], for bounded and subgaussian losses this easiness is determined by a 'central' condition which generalizes both

1. the Bernstein condition of e.g. Bartlett and Mendelson (2006), itself a generalization of the Tsybakov margin condition to general losses and situations where the Bayes predictor is not in the class
2. stochastic mixability, itself a generalization of the stochastic-exp-concavity condition of Juditsky-Rigollet-Tsybakov (2008) for fast rates in on-line learning

Moving from stochastic to online adversarial, here we introduce a condition that expresses for an individual data sequence how 'easy' it is. We develop the individual-sequence central-condition with parameter $$\alpha$$ between $$0$$ and $$1/2$$. Under this condition, the Hedge algorithm with $$K$$ predictors and learning rate decreasing as $$T^{-\alpha}$$ achieves a regret of order $$(\log K)^{1-\alpha} T^\alpha$$, nicely interpolating between the worst-case $$\sqrt{ (\log K) T }$$ regret and the 'easy' $$(\log K)$$ regret that can be achieved if losses are i.i.d. While to use Hedge, one needs to know $$\alpha$$ in advance, the recent Squint algorithm [2] automatically adapts to the right $$\alpha$$.

(joint work with Wouter Koolen and Tom Sterkenburg)

T. van Erven, P. D. Grünwald, N. A. Mehta, M. D. Reid and R. C. Williamson. Journal of Machine Learning Research, vol. 16, pp. 1793-1861, 2015. In the special issue dedicated to the memory of Alexey Chervonenkis.
W. M. Koolen and T. van Erven. JMLR Workshop and Conference Proceedings, vol. 40: Proceedings of the 28th Conference on Learning Theory (COLT), pp. 1155-1175, 2015.

Satyen Kale, Yahoo! Labs
Optimal and Adaptive Online Boosting Algorithms [slides]
15:00-15:30

We initiate the study of boosting in the online setting, where the task is to convert a "weak" online learner into a "strong" online learner. The notions of weak and strong online learners directly generalize the corresponding notions from standard batch boosting. In this talk we focus on the classification setting, for which we develop two online boosting algorithms. The first algorithm is an online version of boost-by-majority, and we prove that it is essentially optimal in terms of the number of weak learners and the sample complexity needed to achieve a specified accuracy. The second algorithm is adaptive (in the sense that it obtains improved performance for "easy data") and parameter-free, albeit not optimal.

Gergely Neu, INRIA Lille
On adaptive regret bounds for non-stochastic bandits [slides]
17:00-17:45

In recent years, the literature on data-dependent performance bounds for full-information online learning underwent a rapid growth and by now, we have an almost complete understanding of what guarantees are achievable and what are not. However, the picture in partial-information (and, in particular, bandit) problems is much more scattered: only a handful of adaptive bounds existing for full-information settings have equivalents in the bandit case. The most surprising gap in the literature is the lack of "first-order" regret bounds for bandits, which is the most basic improvement in full-information settings, achievable by nearly all known full-information learning algorithms. In this talk, I review some existing results concerning adaptive bounds and describe a new efficient algorithm that achieves a first order bound for a general class of bandit problems.

In the second half of the talk, I describe some of the main obstacles for proving adaptive performance bounds for bandit problems. In particular, I show how the most powerful tools for proving adaptive bounds for the full information case break down in the bandit case, suggesting that the possibilities for learning faster from easy data may be much more limited in partial-information settings than in full-information problems.

## Contributed Papers

The following papers will be presented as spotlight presentations and as posters during the poster session.

### Spotlight I (11.30 - 12.00)

Cheng Tang and Claire Monteleoni. Scalable constant k-means approximation via heuristics on well-clusterable data. [ spotlight | paper ]

Shai Ben-David. Clustering is easy when . . . what?. [ paper ]

Aymeric Dieuleveut and Francis Bach. Adaptativity of stochastic gradient descent. [ spotlight | paper ]

Mehryar Mohri and Scott Yang. Accelerating optimization: Pre-empting the leader via adaptive prediction. [ spotlight | paper ]

Zhe Li, Tianbao Yang, Lijun Zhang, and Rong Jin. A two-stage approach for learning a sparse model with sharp excess risk analysis. [ spotlight | paper ]

Manfred K. Warmuth and Jiazhong Nie. Measuring the “on-lineness” of data streams. [ spotlight | paper ]

### Spotlight II (15.30 - 16.00)

Wojciech Kotłowski. Minimax strategy for prediction with expert advice under stochastic assumptions. [ spotlight | paper ]

Vu Dinh, Lam Si Tung Ho, Nguyen Viet Cuong, Duy Nguyen, and Binh T. Nguyen. Learning from non-iid data: Fast rates for the one-vs-all multiclass plug-in classifiers. [ spotlight | paper ]

Michał Dereziński and Badri Narayan Bhaskar. Anticipating concept drift in online learning. [ spotlight | paper ]

Mehryar Mohri and Scott Yang. Data-dependent algorithms for bandit convex optimization. [ spotlight | paper ]

Ruitong Huang, András György, and Csaba Szepesvári. Easy data for independent component analysis. [ spotlight | paper ]

Akshay Balsubramani and Yoav Freund. Aggregating binary classifiers with general losses. [ spotlight | paper ]

## Schedule

 From To What 9.00 9.30 Introduction by organizers [slides] 9.30 10.00 Invited talk: Shai Ben-David Clustering is Easy When . . . What? [slides] (replaces Peter Auer) 10.00 10.30 COFFEE BREAK 10.30 11.00 Invited talk: Aarti Singh Tsybakov noise adaptive margin-based active learning [slides] 11.00 11.30 Invited talk: Dylan Foster Adaptive Online Learning [slides] 11.30 12.00 Poster spotlights I 12.00 14.30 LUNCH BREAK 14.30 15.00 Invited talk: Peter Grünwald Easy Data [slides] 15.00 15.30 Invited talk: Satyen Kale Optimal and Adaptive Online Boosting Algorithms [slides] 15.30 16.00 Poster spotlights II 16.00 16.30 POSTERS + COFFEE BREAK 16.30 17.00 Poster session 17.00 17.45 Invited talk: Gergely Neu On adaptive regret bounds for non-stochastic bandits [slides] 17.45 18.30 Panel discussion

## Organizers

Tim van Erven, Universiteit Leiden, The Netherlands

Wouter M. Koolen, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands