The Fifth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE) will be held in Amsterdam, Netherlands, on 27 August – 30 August, 2012, in the beautiful 17th century Trippenhuis. The workshop will cover hot topics at the intersection of statistics, information theory, machine learning, and their applications. The technical program will include plenary lectures and invited talks. Participation in the workshop is by invitation only.
Important dates
Abstract submission deadline | 12 August, 2012 |
---|---|
Workshop dates | 27–30 August 2012 |
Travel and Location
Workshop venue
The WITMSE 2012 workshop will be held at the Trippenhuis, a beautiful 17th century building in the centre of Amsterdam, and the seat of the Royal Academy of Arts and Sciences (KNAW). Click here for details on how to reach the Trippenhuis. All WITMSE sessions will take place in the Old Meeting Room.How to get to Amsterdam?
The easiest way into Amsterdam is by taking the train to Amsterdam Central Station. The Trippenhuis is a 10 minute walk from there. From Schiphol Airport, trains to the Central Station leave at least once every 10 minutes. You can use the travel planner from the Dutch Railway (NS) for more information. If you should come to Amsterdam by car, be advised that it is a bicycle friendly, but very car-unfriendly place. You will need to park your car at a Park&Ride by the edge of the city to avoid high parking fees.Restaurants, bars, things to do
The workshop venue is very close to Nieuwmarkt and Zeedijk; exploring the city centre from there you will easily find many suitable bars and restaurants, in particular offering Oriental food but there are also many other options.
The map below lists some possibilities.
View Restaurants and bars near the Trippenhuis in a larger map
If you are looking for more tips, including an explanation of the public transportation system, this guide is pretty thorough.
Plenary Speakers
Peter
Grünwald (Centrum Wiskunde & Informatica, Amsterdam, Netherlands) Safe MDL: learning the learning rate via empirical mixability
MDL inference can behave badly if the model under consideration is
wrong: in some simple settings, one of our models is a reasonably
`good' approximation to the data generating machinery but we keep
selecting a very bad model forever. We introduce a test, inspired by
Vovk's mixability concept, that can tell from the data whether we
are heading for such a situation. If we are, we can adjust the
learning rate (equivalently: make the codelength for the hypothesis
larger) in a data-dependent way. The resulting "safe" estimator
continues to achieve good rates with wrong models. For example, when
applied to a union of classification models, the safe estimator
achieves the optimal rate for the Tsybakov exponent of the
underlying distribution. This establishes a connection between
Rissanen's MDL (based on probability distributions viewed as codes)
and Vapnik's statistical learning theory (based on predictor models
such as classifiers). Time: 15:30-16:30, Monday, August 27 |
Dominik
Janzing (Max Planck Institute, Tübingen, Germany) Connecting description length to causality: The algorithmic Markov condition
Causal inference is usually concerned with exploring causal relations among
random variables after observing sufficiently many samples drawn from the
joint probability distribution. The formal basis is the Causal Markov Condition
stating that every variable is conditionally statistically independent
of its non-effect, given its direct causes.
However, causal conclusions in every-day life are usually not based on statistical data analysis. Instead, we even infer causal links among single objects: Significant similarities between two texts or images, for instance, indicate that one author or artist has copied from the other, or that both have been influenced by common third party material. In such cases it is not clear how to interpret the observed similarities as statistical dependences of appropriate random variables. Such non-statistical dependences can be quantified by algorithmic mutual information, i.e., the difference between the sum of the individual compression lengths and the joint compression length. We have then postulated and justified an algorithmic version of the Causal Markov Condition where also conditional dependences are based on compression lengths, i.e., Kolmogorov complexities [1]. Apart from being a formal basis for causal inference among single objects, this new postulate also implies novel statistical inference rules because probability distributions also contain non-statistical (algorithmic) information that may tell us the causal direction [2,3]. Literature: [1] D. Janzing and B. Schölkopf: Causal inference using the algorithmic Markov condition, IEEE TIT 2010. [2] D. Janzing and B. Steudel: Justifying additive-noise-based causal discovery via algorithmic information theory, OSID 2010. [3] J. Lemeire and D. Janzing: Replacing causal faithfulness with the algorithmic independence of conditionals, Minds & Machines 2012. Time: 9:00-10:00, Tuesday, August 28 |
Rui M. Castro (Eindhoven University of Technology, Netherlands) Learning to Learn: Adaptive sensing for sparse signal inference
Traditional approaches to statistical inference and machine learning
often rely on passive data collection methodologies. However, in
many practical scenarios it is possible to adjust the data
collection process based on information gleaned from previous
observations, in the spirit of the "twenty-questions" game, in what
is known as adaptive sensing, active learning, or sequential design
of experiments. This means the data collection process is
sequential and the collection of future data is guided by all data
collected so far. This is in contrast with the majority of
traditional approaches where all the data is collected passively
before any analysis is done. Despite the potential to dramatically
improve inference performance, the complicated data dependencies
created by the closed-loop observation process together with
measurement uncertainty and noise create numerous analytic
challenges. In this talk I consider in particular a scenario of
detection and estimation of high-dimensional sparse signals in
noise, and show an adaptive sensing procedure that provably
outperforms the best possible inference methods based on
non-adaptive sensing, allowing for detection and estimation of
extremely weak signals, imperceptible without the use of adaptive
sensing. Remarkably the minimum signal magnitude necessary to
guarantee accurate detection or estimation does not scale with the
extrinsic dimension of the signal, and the proposed procedures are
nearly optimal, as it is possible to characterize the fundamental
performance limits of any adaptive sensing procedure in the settings
considered.
Time: 9:00-10:00, Wednesday, August 29 |
Outline of the programme
The WITMSE 2012 workshop will be held at the Trippenhuis (click here for directions). All sessions will take place in the Old Meeting Room.
Monday, Aug 27 | 15:00-16:20 | Opening and plenary talk #1 |
---|---|---|
16:20-18:00 | Reception | |
Tuesday, Aug 28 | 9:00-17:00 | Plenary talk #2 and sessions |
19:00-21:00 | Workshop dinner at Kantjil & de Tijger | |
Wednesday, Aug 29 | 9:00-17:00 | Plenary talk #3 and sessions |
Thursday, Aug 30 | 9:00-12:10 | Sessions |
Technical programme
Monday, August 27
15:00-15:20 Opening of the Workshop
15:20-16:20 Plenary Talk: Peter Grünwald
16:20-18:00 Reception
Tuesday, August 28
9:00-10:00 Plenary Talk: Dominik Janzing
10:00-10:30 Coffee Break
10:30-11:50 Session 1 (chair: Peter Grünwald)11:50-13:30 Lunch
- Oliver Johnson: Thinning and concavity of entropy [abstract]
I will describe how Renyi's thinning operation acts on integer-valued random variables in a similar way to scaling real-valued variables. As a result, it is possible to formulate versions of the entropy power inequality and transportation inequality for integer-valued variables, and perhaps understand the group testing problem as an analogue of compressive sensing.
- Yuri Kalnishkan, Michael Vyugin, Vladimir Vovk: Asymptotic complexity and generalised entropy [abstract]
The talk will explore connections between asymptotic complexity and generalised entropy. Asymptotic complexity of a language (a language is a set of finite or infinite strings) is a way of formalising the complexity of predicting the next element in a sequence: it is the loss per element of a strategy asymptotically optimal for that language. Generalised entropy extends Shannon entropy to arbitrary loss functions; it is the optimal expected loss given a distribution on possible outcomes. It turns out that the set of tuples of asymptotic complexities of a language w.r.t. different loss functions can be described by means of generalised entropies corresponding to the loss functions.
- Daniil Ryabko: Asymptotic statistics of stationary ergodic time series [abstract]
It is shown how to construct asymptotically consistent efficient algorithms for various statistical problems concerning stationary ergodic time series. The considered problems include clustering, hypothesis testing, change-point estimation and others. The presented approach is based on empirical estimates of the distributional distance. Some open problems and challenges are also presented.
- Peter Harremoës: Central inequality theorems [abstract]
A sum of random variables can be approximated by a Gaussian for values near the mean while tail probabilities can be estimated via large deviation theory. For exponential families one can use information divergence to divine signed loglikelihood statistics that transform many distributions into distributions that are very close to Gaussian for both small and large deviations. In special cases we can even prove "Central Inequality Theorems" for these variables and such inequalities have central limit theorems as corollaries.
13:30-15:10 Session 2 (chair: Yuri Kalnishkan)
15:10-15:40 Coffee Break
- Erkki P. Liski, Antti Liski: Penalized least squares model averaging [abstract]
In model selection one attempts to use the data to find a single ``winning'' model, according to a given criterion, whereas with model averaging (MA) one seeks a smooth compromise across a set of competing models. Most existing MA methods are based on estimation of all model weights using exponential Akaike information criterion (AIC) or Bayesian information criterion (BIC) weights, for example. A common challenge for a regression analyst is the selection of the best subset from a large set of predictor variables. Then the number of candidate models may become huge and any approach based on estimation of all single weights becomes computationally infeasible. Our approach is to convert estimation of model weights into estimation of shrinkage factors with trivial computational burden. We define the class of shrinkage estimators in view of MA and show that the estimators can be constructed using penalized least squares (LS) estimation with appropriate restrictions on the penalty function. The relationship between shrinkage and parameter penalization provides tools to build up computationally efficient MA estimators which are easy to implement into practice. We also display the performance of the estimators in simulation experiments which are based on a realistic set up with real data.
- Alberto Giovanni Busetto, Morteza Haghir Chehreghani, Joachim M. Buhmann: Approximation Set Coding for information theoretic model validation [abstract]
Models are mathematical tools aimed at prediction. In the context of model selection, a fundamental question arises: which model best generalizes the available data? We discuss the central ideas of a recently introduced principle for model validation: Approximation Set Coding (ASC). ASC is based on information theory and it is inspired by concepts of statistical physics. It designs a code based on sets of solutions called approximation sets in the context of a communication scenario. We compare the application of ASC to clustering and learning Boolean functions. We highlight the generality of the approach by identifying the fundamental similarities between the tasks and their conclusions. Experimental results are discussed in the biological application domain.
- Thijs van Ommen: Adapting AIC to conditional model selection [abstract]
In statistical settings such as regression and time series, we can condition on observed information when predict- ing the data of interest. For example, a regression model explains the dependent variables y_1, ..., y_n in terms of the independent variables x_1, ..., x_n. When we ask such a model to predict the value of y_{n+1} corresponding to some given value of x_{n+1}, that prediction's accuracy will vary with x_{n+1}. Existing methods for model selection do not take this variability into account, which often causes them to select inferior models. One widely used method for model selection is AIC (Akaike's Information Criterion), which is based on estimates of the KL divergence from the true distribution to each model. We propose an adaptation of AIC that takes the observed information into account when estimating the KL divergence, thereby getting rid of a bias in AIC's estimate.
- Steven de Rooij: Loss Rank [abstract]
Marcus Hutter and Minh-Ngoc Tran's Loss Rank principle is an interesting alternative to regularized model selection criteria such as MDL, penalized ML and Bayes factors. This talk reinterprets loss rank as a special case of MDL, but based on a new, very robust universal code. The loss rank code has a number of advantages over Normalized Maximum Likelihood, which is the norm in MDL learning as it minimizes the worst-case regret.
- Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, Shengjun Pan, Ananda Suresh: Competitive information processing [abstract]
A classifier associates a test sequence with the one of two training sequences that was generated by the same distribution. A closeness test determines whether two sequences were generated by the same or by different distributions. For both problems all natural algorithms are symmetricâ€”they make the same decision under all symbol relabelings. With no assumptions on the distributionsâ€™ support size or relative distance, we construct a classifier and closeness test that require at most n^(3/2) samples to attain the n-sample accuracy of the best symmetric classifier or closeness test designed with knowledge of the under- lying distributions. Both algorithms run in time linear in the number of samples. Conversely we also show that for any classifier or closeness test, there are distributions that require at least n^(7/6) samples to achieve the n-sample accuracy of the best symmetric algorithm that knows the underlying distributions.
15:40-17:00 Session 3 (chair: Oliver Johnson)
19:00-21:00 Workshop dinner
- Paul Vitányi: Rate distortion and denoising of individual data using Kolmogorov complexity [abstract]
Using Kolmogorov complexity theory we fill a gap left by Shannon's theory of lossy compression (rate distortion theory). It turns out that individual objects commonly have rate distortion characteristics that are radically different from expected one supplied by Shannon's approach. We relate low algorithmic mutual information to low Kolmogorov complexity (and consequently suggest that certain aspects of the mutual information formulation of Shannon's rate-distortion function behave differently than would an analogous formulation using algorithmic mutual information); we explore the notion that strings with low Kolmogorov complexity among the strings having equal distortion with respect to the word to be transmitted capture the interesting properties of the latter (which is hard to formalize in Shannon's theory) and this suggests an approach to denoising.
To make the theory practical, we have implemented the Kolmogorov complexity approach by code length output of real-world lossless compressors. This is joint work with N.K. Vereshchagin and S. de Rooij:
N.K. Vereshchagin and P.M.B. Vitanyi, Rate distortion and denoising of individual data using Kolmogorov complexity, IEEE Trans. Information Theory, 56:7(2010), 3438-34 54.
S. de Rooij, P. Vitanyi, Approximating rate-distortion graphs of individual data: Experiments in lossy compression and denoising, IEEE Trans. Comp., 61:3(2012), 395-407.
- Flemming Topsøe: Beyond Shannon with three examples from geometry, statistics and information theory [abstract]
In previous contributions to WITMSE, an abstract theory of cognition inspired by information theory, but going beyond classical Shannon theory in certain respects, was outlined. Here, we continue the work by presenting three concrete problems. Sylvester's problem from geometric location theory, a problem of universal coding from information theory and the problem of isotone regression from statistics. At first, we focus on non-technical, philosophically oriented considerations. A more complete analysis of isotone regression follows and finally we point out a surprising connection between this problem and the one from universal coding.
- Jan Lemeire: As the atypical case may be common, what about the performance of algorithms? [abstract]
It is well-known that a difference should be made between typical and atypical, structured input to algorithms, especially when studying their performance. For typical data, a quicksort will give log n complexity, while for atypical data the performance can reach n^2. By considering a uniform prior probability over the input space, the occurrence of structured input is neglible. However, performance assessment based on random sampling is questionable, since the assumption of a uniform prior is wrong. Structure deserves a higher prior; structure is an important ingredient of nature.
In contrast to the uniform prior, Solomonoff's universal prior attributes a higher prior to simple structures. These considerations are also important with respect to the no free lunch theorems. But, can we offer a generic method for taking structure into account? Do we have a way of modeling, generating or identifying 'structure'? We will discuss these issues with some examples.
- Boris Ryabko: An information-theoretic method for estimating the performance of computer systems [abstract]
We consider a notion of computer capacity as a novel approach to evaluation of computer performance. Computer capacity is based on the number of different tasks that can be executed in a given time. This characteristic does not depend on any particular task and is determined only by the computer architecture. It can be easily computed at the design stage and used for optimizing architectural decisions.
Wednesday, August 29
09:00-10:00 Plenary Talk: Rui M. Castro
10:00-10:30 Coffee Break
10:30-11:50 Session 4 (chair: Daniil Ryabko)11:50-13:30 Lunch Break
- Ugo Vespier, Arno Knobbe, Siegfried Nijssen, Joaquin Vanschoren: MDL-Based identification of relevant temporal scales in time series [abstract]
The behavior of many complex physical systems is affected by a variety of phenomena occurring at different temporal scales. Time series data produced by measuring properties of such systems often mirrors this fact by appearing as a composition of signals across different time scales. When the final goal of the analysis is to model the individual phenomena affecting a system, it is crucial to be able to recognize the right temporal scales and to separate the individual components of the data. We introduce a solution to this challenge based on a combination of the Minimum Description Length (MDL) principle, feature selection strategies, and convolution techniques from the signal processing field. As a result, we show that our algorithm produces a good decomposition of a given time series and, as a side effect, builds a compact representation of its identified components. Experiments demonstrate that our method manages to identify correctly both the number and the temporal scale of the components for real-world as well as artificial data and show the usefulness of our method as an exploratory tool for analyzing time series data.
- Ralf Eggeling, Teemu Roos, Petri Myllymäki, Ivo Grosse: Comparison of NML and Bayesian scoring criteria for learning parsimonious Markov models [abstract]
Parsimonious Markov models, a generalization of variable order Markov models, have been recently introduced for modeling biological sequences. Up to now, they have been learned by Bayesian approaches. However, reliable prior knowledge is not always available, and an uninformative prior cannot be defined easily. Hence we adapt NML-inspired criteria that have been proposed for structure and parameter learning of Bayesian networks to parsimonious Markov models. We empirically compare the performance of the NML approach to that of the Bayesian approach for classifying splice sites, which is a central problem in computational biology.
- Jesús E. García, Verónica A. González-López, M. L. L. Viola: Model selection for multivariate stochastic processes [abstract]
We address the problem of model selection for a multivariate source with finite alphabet. A new class of multivariate Markov models and a model selection algorithm for this class of models are introduced. For Markovian sources the model selection procedure is consistent in the sense that, eventually, as the collected data grows, the source's Markov model will be retrieved exactly and it will be described with a minimal number of parameters. This work extend and generalize previous results about minimal Markov models and context tree models.
- Jesús E. García,Verónica Andrea González-López, M. L. L. Viola: Robust model selection for stochastic processes [abstract]
In this paper we address the problem of model selection for the set of finite memory stochastic processes with finite alphabet, when the data is contaminated. We consider m independent samples, with more than half of them being realizations of the same stochastic process with law Q, which is the one we want to retrieve. We devise a model selection procedure such that for a sample size large enough, the selected process is the one with law Q. Our model selection strategy is based on estimating relative entropies to select a subset of samples that are realizations of the same law. Although the procedure is valid for any family of finite order Markov models, we will focus on the family of variable length Markov chain models, which include the fixed order Markov chain model family. We define the asymptotic breakdown point (ABDP) for a model selection procedure, and we show the ABDP for our procedure. This means that if the proportion of contaminated samples is smaller than the ABDP, then, as the sample size grows our procedure selects a model for the process with law Q. We also use our procedure in a setting where we have one sample conformed by the concatenation of sub-samples of two or more stochastic processes, with most of the sub-samples having law Q.
13:30-15:10 Session 5 (chair: Peter Harremoës)
15:10-15:40 Coffee Break
- Łukasz Dębowski: Information-theoretic models of natural language [abstract]
In 1990, German telecommunications engineer Wolfgang Hilberg conjectured that the mutual information between two adjacent blocks of text in natural language grows according to the power law as a function of the block length. Hilberg's conjecture was based on an extrapolation of Shannon's (1951) seminal experimental data. In this talk, we will review some previous results of ours that concern two issues connected to Hilberg's conjecture. First, we will exhibit a class of stochastic processes, called Santa Fe processes, which are motivated linguistically and for which mutual information grows according to the power law. Second, we will demonstrate that a power law growth of mutual information implies a power law growth of vocabulary. The latter statement is observed for texts in natural language and called Herdan's law.
- Zhanna Reznikova, Boris Ryabko: Application of information theory for studying numerical competence in animals: an insight from ants [abstract]
Our long-term experimental study on ant ``language'' and intelligence fully based on ideas of information theory revealed a symbolic language in highly social ant species and demonstrated these insects as being able to trasnfer to each other the information about the number of objects and can even add and subtract small numbers in order to optimize their messages. We suggest that application of ideas of information theory can open new horizons for studying numerical competence in non-human animals.
- Mervi Eerola, Satu Helske: Analysing life history calendar data: a methodological comparison [abstract]
The life history calendar, also called an event- history calendar, is a data-collection tool for ob- taining reliable retrospective data about life events. The advantage of a life history calendar is that the order and proximity of important transitions in multiple life domains can be studied at the same time.
To illustrate the analysis of such data, we compare the model-based probabilistic event history analysis and a more recent type of approach of model-free data-mining, sequence analysis. The latter is well known in bioinformatics in the analysis of protein or DNA sequences. In life course analysis it is less familiar but has provided novel insight to the diversity of life trajectories and their relationship to life satisfaction. We emphasize the differences, but also the complementary advantages of the methods.
In event history analysis, we consider the data generated by a marked point process (T_n , X_n )_{n >= 1}, a time-ordered sequence of points or events, characterised by pairs of random variables, the occurrence times T_1,T_2,... and marks X_1,X_2,... describing what happens at a particular T. Instead of transition hazards, we estimate the cumulative prediction probabilities of a particular life event in the entire observed trajectory, given the history of the marked point process. This way of combining information in multi-state event history models has been called `survival synthesis'. The innovation gain from observing a life event at a particular age, related to the prediction of another life event, can be quantified and monitored visually.
In sequence analysis, we compare several dissimilarity measures between the life sequences, either assuming independence or using some ad hoc definition of dependence between the sequence elements. We also contrast data-driven (estimated) and user-defined costs of substituting one sequence element with another.
As an example, we study young adults' transition to adulthood as a sequence of events in three life domains (partnership, parenthood and employment). The events define the multi-state event history model and the parallel life domains in the multidimensional sequence analysis.
We conclude that the two approaches complement each other in life course analysis; sequence analysis can effectively find typical and atypical life patterns while event history analysis is needed for causal inquiries.
- Jilles Vreeken, Nikolaj Tatti: Summarising Event Sequences with Serial Episodes [abstract]
Pattern mining is a branch of data mining concerned with finding informative local associations in data. A well-known example being frequent itemset mining: finding those sets of products that sold at least so-many times. The key problem in traditional pattern mining is the pattern explosion: while trivial constraints reveal only common knowledge, more loose constraints lead to enormous amounts of patterns that are impossible to use or analyse.
The root of this problem is that pattern mining was not meant for data description, that is, redundancy was not penalized. To this end, we take an information theoretic approach, and instead aim for small and non-redundant sets of patterns that together describe the data well. In this talk, I will discuss SQS, a new MDL-based approach for finding good sets of serial episodes (sub-sequences allowing gaps) directly from data. Experiments show SQS efficiently discovers easily interpretable models that describe the data well.
- Hannes Wettig, Javad Nouri, Kirill Reshetnikov, Roman Yangarber: Information-theoretic methods for analysis and inference in etymology [abstract]
We introduce a family of minimum description length models which explicitly utilizes phonetic features and captures long-range contextual rules that condition recurrent correspondences of sounds within a language family. We also provide an algorithm to learn a model from this family given a corpus of cognates, sets of genetically related words. Finally, we present an imputation procedure which allows us compare the quality of alignment models, as well as the goodness of the data sets. Our evaluations demonstrate that the new model yields improvements in performance, as compared to those previously reported in the literature.
15:40-17:00 Session 6 (chair: Teemu Roos)
- Vladimir Vovk: Informational and computational efficiency of set predictors [abstract]
There are two methods of set prediction that are provably valid under the assumption of randomness: transductive conformal prediction and inductive conformal prediction. The former method is informationally efficient but often lacks computational efficiency. The latter method is, vice versa, computationally efficient but less efficient informationally. This talk discusses a new method, which we call cross-conformal prediction, that combines informational efficiency of transductive conformal prediction with computational efficiency of inductive conformal prediction. The downside of the new method is that its validity is an empirical rather than mathematical fact.
- Kenji Yamanishi, Ei-ichi Sakurai, Hiroki Kanazawa: Change detection, hypothesis testing and data compression [abstract]
We are concerned with the issue of detecting changes of statistical models when they change over time. We introduce the dynamic model selection (DMS) algorithm for learning model sequences on the basis of the minimum description length (MDL) principle. We first analyze it from the view of hypothesis testing.We evaluate error probabilities for testing the occurrences of change-points and relate them to the model transition estimators and the distance between the models to be distinguished. We then apply the DMS algorithm into data compression via piecewise stationary memoryless sources (PSMS's). We give a method for discretizing the parameter space to obtain an optimal data compression bound. From the both views of hypothesis testing and data compression, we argue how to discretize the parameter space in order to obtain ideal performance.It yields a new view of distinguishability of probabilistic models from the standpoint of change-detection.
- David R. Bickel: Information-theoretic probability combination with applications to reconciling statistical methods [abstract]
This talk proposes both a novel solution to the problem of combining probability distributions and a framework for using the new method to combine the results of differing statistical methods that may legitimately be used to analyze the same data set. While the talk emphasizes theoretical development, it is motivated by the need to combine two conflicting estimators of the probability of differential gene expression.
- Anjali Mazumder, Steffen Lauritzen: Information-Theoretic Value of Evidence Analysis Using Probabilistic Expert Systems [abstract]
The evaluation and interpretation of evidence is often made under uncertainty where the task of reasoning involves estimating unknown quantities from some given observations. There is often a quest for data to reduce uncertainty. Forensic scientists are often called upon in courts to give expert testimony in a court of law or public enquiry, e.g. the source of a DNA sample. Their evaluation and interpretation of the evidence is often under scrutiny and they are often asked to justify their decision-making process. The task of decision-making, evaluating, and interpreting the evidence is further tested when there are multiple sources of evidence which may or may not relate to the same query. Information is seldom cost free and therefore there is a need to evaluate beforehand whether it is worthwhile to acquire and to decide which (sources of information) to consult that would optimise the desired goal (i.e. reduction in uncertainty about the inference).
Using information-theoretic concepts, Lauritzen and Mazumder (2008) defined a value of evidence (VOE) criterion Iq as a general measure of informativeness for any forensic query Q and collection of evidence X_1,...,X_K where the probability distribution of the query (given evidence) is of interest. When there are multiple sources of information, a decision-theoretic framework provides a systematic approach to considering which test(s) to perform that most contributes to reducing uncertainty regarding the query. A probabilistic network formulation provides an attractive platform for the graph-theoretic representation of the VOE problem and eases the laborious calculation of marginal and conditional probabilities of interest. When the configuration space for exact computations and exhaustive searching is infeasible, Monte Carlo sampling methods are employed.
The VOE criterion Iq, having a solid theoretical basis, has been directly applied to a variety of planning problems in forensic genetics to determine the quantity and choice of individuals and genetic markers to type to gain sufficient information for evaluation and interpretation (Mazumder, 2010). This approach is extended to consider other complex evidential reasoning cases involving multiple evidence types in which the graph modular structure and conditional independence properties are exploited to aid the decision-making and reasoning process. This research aims to contribute in three ways: (1) developing computational methods in VOE analysis using PESs, (2) developing a decision-theoretic framework for planning and inference in the evaluation of complex evidence structures, and advancing the evaluation and interpretation of forensic evidence methods.
References:
Lauritzen, S. and Mazumder, A. (2008). Informativeness of genetic markers for forensic infernece - an information-theoretic approach. Forensic Science International: Genetics Supplement Series, 1:652-653.
Mazumder, A. (2010). Planning in Forensic DNA Identification Using Probabilistic Expert Systems, Doctoral dissertation, University of Oxford, Oxford.
Thursday, August 30
09:00-10:20 Session 7 (chair: Volodya Vovk)
10:20-10:50 Coffee Break
- Wouter Koolen, Dimitri Adamskiy, Manfred K. Warmuth: Putting Bayes to sleep [abstract]
Consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i.e. the data is likely to be predicted well by models that predicted well before? Curiously, fitting such ``sparse composite models'' is achieved by mixing in a bit of all the past posteriors. This self-referential updating method is rather peculiar, but it is efficient and gives superior performance on many natural data sets. Also it is important because it introduces a long-term memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors.
We build atop the ``specialist'' framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a well-studied multitask learning problem and obtain a new intriguing efficient update that achieves a significantly better bound.
- Fares Hedayati, Peter L. Bartlett: The optimality of Jeffreys prior for online density estimation and the asymptotic normality of Maximum Likelihood estimators [abstract]
We study online learning under logarithmic loss with regular parametric models. We show that a Bayesian strategy predicts optimally only if it uses Jeffreys' prior. This result was known for canonical exponential families; we extend it to parametric models for which the maximum likelihood estimator is asymptotically normal. The optimal prediction strategy, normalized maximum likelihood, depends on the number n of rounds of the game, in general. However, when a Bayesian strategy is optimal, normalized maximum likelihood becomes independent of n. Our proof uses this to exploit the asymptotics of normalized maximum likelihood. The asymptotic normality of the maximum likelihood estimator is responsible for the necessity of Jeffreys' prior.
- Manfred K. Warmuth: From Relative Entropies to Bregman Divergences, to the Design of Convex and Tempered Non-convex Losses [abstract]
Bregman divergences are a natural generalization of the relative entropy and the squared Euclidean distance. They are used as regularizers in Machine learning and in the design of loss functions. For a single neuron with an increasing transfer function, the "matching loss" is a certain integral of the transfer function which is always convex since it is a Bregman divergence. For example if the transfer function is the identity function then the matching loss becomes the square loss, and for the sigmoid transfer function the matching loss is the logistic loss.
Convex losses are convenient to use. However they are inherently non-robust to outliers. By bending down the "wings" of a convex loss we get non-convex losses. An example of such a "mismatched loss" occurs when the square loss is used for a sigmoided neuron. This loss is non-convex in the weight vector and can lead to exponentially many local minima. Even though mismatched losses are troublesome they are needed because they are robust to outliers.
We introduce a temperature into the logistic loss by starting with the matching loss corresponding to the t-entropy. By letting letting the temperatures of the label and the estimator veer apart, we gradually introduce non-convexity into the matching loss function. In experiments we achieve robustness to outliers while still avoiding local minima.
Joint work with Nan Ding and S. V. N. Vishwanathan.- Wouter Koolen, Wojciech Kotłowski, Manfred K. Warmuth: Learning eigenvectors for free [abstract]
We extend the classical problem of predicting (coding) a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products u u^T, where u is a vector in R^n of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n^2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret.
10:50-12:10 Session 8 (chair: Fares Hedayati)
12:10-13:20 Lunch
- Jun'ichi Takeushi, Yoshinari Takeishi: Sparse Superposition Codes with Discrete Dictionary [abstract]
We study some extension of the sparse superposition codes proposed by Barron and Joseph in 2010. The original sparse superposition codes are designed using a dictionary of codewords which consists of real valued vectors. The codes are proved to achieve rate up to the capacity of the Gaussian channel, when entries of the dictionary are independently drawn from a normal distribution [Barron and Joseph 2010]. We prove that the codes achieve rate up to the capacity, even if entries of the dictionary are independent equiprobable ±A (ignoring computational complexity). The proposition was conjectured in [Barron and Joseph 2011].
- Kazuho Watanabe, Shiro Ikeda: Convex Formulation for Nonparametric Estimation of Mixing Distribution [abstract]
We discuss a nonparametric estimation method of the mixing distribution in mixture models. We propose an objective function with one parameter, where its minimization becomes the maximum likelihood estimation or the kernel vector quantization in special cases. Generalizing Lindsay's theorem for the nonparametric maximum likelihood estimation, we prove the existence and discreteness of the optimal mixing distribution and devise an algorithm to calculate it. Furthermore, we make a connection of this unifying estimation framework to the rate-distortion problem. It is demonstrated that with an appropriate choice of the parameter, the proposed method is less prone to overfitting than the maximum likelihood method.
- Guoqiang Zhang, Richard Heusdens: Efficient Message-Passing for Distributed Quadratic Optimization [abstract]
Distributed quadratic optimization (DQO) has found many applications in computer science and engineering. In designing a message-passing algorithm for DQO, the basic idea is to associate the considered quadratic function with a graphic model. The nodes in the graph send local information in message-form to their neighbours mutually and iteratively until reaching the global optimal solution. The efficiency of a message-passing algorithm depends on its computational complexity, the number of parameters to be transmitted, and its convergence speed. In this work, we investigate the performance of several message-passing algorithms for comparison. In particular, we consider the Jacobi-relaxation algorithm, the generalized linear coordinate-descent (GLiCD) algorithm and the min-sum-min algorithm.
- So Hirai, Kenji Yamanishi: Clustering Change Detection Using Normalized Maximum Likelihood Coding [abstract]
We are concerned with the issue of detecting changes of clustering structures from multivariate time series. From the viewpoint of the minimum description length~(MDL) principle, we introduce an algorithm that tracks changes of clustering structures so that the sum of the code-length for data and that for clustering changes is minimum.Here we employ a Gaussian mixture model~(GMM) as representation of clustering, and compute the code-length for data sequences using the normalized maximum likelihood (NML) coding. The introduced algorithm enables us to deal with clustering dynamics including merging, splitting, emergence, disappearance of clusters from a unifying view of the MDL principle.We empirically demonstrate using artificial data sets that our proposed method is able to detect cluster changes significantly more accurately than an existing statistical-test based method and AIC/BIC-based methods.
Organization
Honorary chairs:- Jorma Rissanen, Tampere University of Technology, and HIIT
- Kenji Yamanishi, University of Tokyo
- Steven de Rooij, Centrum Wiskunde & Informatica, Amsterdam
- Wojciech Kotłowski, Poznań University of Technology
- Teemu Roos, University of Helsinki
- Petri Myllymäki, University of Helsinki
Past WITMSEs
The Workshop on Information Theoretic Methods in Science and Engineering (WITMSE) is an annual workshop that brings together researchers interested in all aspects of information theory and its applications. WITMSE 2012 is the fifth workshop. Previous meetings in the series were held in Helsinki and Tampere:
Registration
The registration fee (both standard and student) covers the sessions, lunches, coffee breaks, and evening programme. The registration fees are as follows:
Early (before or at 12 Aug) | Late (from 13 Aug) | |
---|---|---|
Student | €220 | €290 |
Non-student | €270 | €340 |
Workshop dinner: We can guarantee that participants with an early registration can attend the banquet dinner. Registration to the banquet is done in conjunction with the workshop registration. Please ask for additional tickets by sending us e-mail.
Students: Proof of Student ID will be required at the workshop on-site check-in.
Accommodation
As many of the nicer Amsterdam hotels have limited availability, rather than recommend individual hotels, we suggest finding a place to stay using booking.com. Select "Amsterdam City Centre, Netherlands" in the search box and fill in the appropriate dates. Then, after selecting your desired filters, turn on the map and look for a highlighted hotel that provides a good tradeoff between price, review score and distance to the workshop venue. It is easy to do so by comparing the map on the booking website to the one embedded below:
Proceedings
Proceedings of the Fifth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE 2012), edited by Steven de Rooij, Wojciech Kotłowski, Jorma Rissanen, Petri Myllymäki, Teemu Roos, and Kenji Yamanishi.
- Persistent URL: http://oai.cwi.nl/oai/asset/20683/20683B.pdf
- ISBN: 978-90-6196-563-3
- BibTeX entry
Instructions for Submitting (Extended) Abstracts
(Please note that this is not an open call for submissions, participation in the workshop is by invitation only.) Each speaker is encouraged to submit a PDF file containing an extended abtract describing the topic of their talk by the deadline of 12 August, 2012. The length of the abstract should be one to four (1–4) pages. The abstract should be formatting according to the WITMSE guidelines, see below.
In the body of your email, please list all the authors as they should appear in the author index (use Latex code), and indicate where in the alphabetical order each name belongs (e.g., do you want "van Gogh" to be listed under "V" or "G"?).
Send the pdf file to s.de.rooij@cwi.nl.
Please make sure that the pages are not numbered (we will add the page numbers later when we merge all the files together).
The submitted paper should be preferably in PDF format or, if that is not possible, in PS format. When submitting a PDF file, all fonts should be embedded and subset, and the compatibility level should be no greater than 1.5 (viewable with Acrobat 6.0 and later). The paper size should be A4.
When creating with LaTeX:
- LaTeX example: witmse_guidelines.tex
- Figure example: witmse_demofig.eps
- Style file: witmse.sty
- Bibtex example: witmse_guidelines.bib
- Bibtex style: IEEEbib.bst
- Created PS file: witmse_guidelines.ps
- Created PDF file: witmse_guidelines.pdf
When creating with Microsoft Word:
- Word example: witmse_guidelines.doc
- Word template: witmse.dot
Instructions for presenting your paper
Each paper is allocated a 20 minute slot in the technical programme. Note: Please do not plan your talk to last full 20 minutes, leave some time for questions and switching the speaker.
For the talks, we will set up a master laptop in the lecture room. You are expected to upload your presentation on the laptop during one of the breaks before your session. The laptop will support PDF and PowerPoint files, and we are working to also make LibreOffice/OpenOffice files possible — if you use the latter, check here later for confirmation. If you must use your own laptop (e.g. if you have a Keynote presentation or a demo that requires certain software), that will be possible but not recommended. In such cases it is especially important to check that everything is in order before the session.