Jesse Krijthe
http://www.jessekrijthe.com/
Recent content on Jesse KrijtheHugo -- gohugo.ioen-usjkrijthe@gmail.com (Jesse Krijthe)jkrijthe@gmail.com (Jesse Krijthe)Fri, 31 Aug 2018 20:00:00 +0000Peirce in the Garden
http://www.jessekrijthe.com/articles/peirce-in-the-garden/
Fri, 31 Aug 2018 20:00:00 +0000jkrijthe@gmail.com (Jesse Krijthe)http://www.jessekrijthe.com/articles/peirce-in-the-garden/
<p>After <a href="http://www.jessekrijthe.com/articles/hacking-inductive-logic/">reading Ian Hacking’s book</a> I got interested in C.S. Peirce’s ideas on inductive inference. I had not heard of Peirce before. Another recent interest is Deborah Mayo’s error statistical philosopy. Luckily, Mayo turns out to be an adminirer of Peirce and wrote a paper connecting her arguments about so-called severe testing to Peirce’s ideas on self correcting science.</p>
<p>As I was not familiar with Peirce’s SCT theory and its criticisms, I can’t really comment on the validity of the nicely explained connection that Mayo is making. However, the following passage from Peirce stood out to me (emphasis mine):</p>
<blockquote>
<p>This account of the rationale of induction is distinguished from others in that it has as its consequences two rules of inductive inference which are very frequently violated namely, that the sample be (approximately) random and that the <strong>property being tested not be determined by the particular sample</strong> — i.e., predesignation.</p>
</blockquote>
<p>As Mayo notes below this passage “when the null hypothesis is tested on the same data that led it to be chosen for testing, it is well known, a spurious impression of a genuine effect easily result”. The way Peirce formulates it, however, reminded me of Loken & Gelman’s Garden of Forking paths paper. Of course, Loken & Gelman’s claim is more subtle. They argue that during an analysis, the data at hand may inform the analysis that is done leading to one analysis path among many potential analysis paths. Had the data looked slightly different we might have made slightly different choices leading to a different analysis.</p>
<p>Peirce’s comment points out that in these cases the rationale for induction breaks down. Only under particular assumptions do conclusions make sense. At the very least, in Mayo’s terminology the tests become less ‘severe’. I find it interesting Peirce already notes this important point before Neyman-Pearson testing was invented, let alone as commonly applied and (mis-)used as it is today, before we strayed into the gardens of forking paths. The passage also reinforces the idea that it requires some thinking to connect the numbers that your favourite statistical programme calculates for you with what it is we have actually learned from the data.</p>
<h3 id="references">References</h3>
<blockquote>
<p>Mayo, D. G. (2005). Peircean Induction and the Error-Correcting Thesis Error-Correcting Thesis. Transactions of the Charles S. Peirce Society, 41(2), 299–319. <a href="https://doi.org/10.1353/csp.2011.0039">Link</a></p>
<p>Mayo, D. G., & Spanos, A. (2011). Error Statistics. Philosophy of Statistics, 7, 153–198. <a href="https://doi.org/10.1016/B978-0-444-51862-0.50005-8">Link</a></p>
<p>Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis, 1–17. <a href="https://doi.org/10.1037/a0037714">Link</a></p>
</blockquote>
PhD Defense
http://www.jessekrijthe.com/articles/phd-defense/
Tue, 16 Jan 2018 00:00:00 +0000jkrijthe@gmail.com (Jesse Krijthe)http://www.jessekrijthe.com/articles/phd-defense/<p><img src="http://www.jessekrijthe.com/img/thesis-stack.jpg" width="400" class="sidenote" /></p>
<p>I defended my PhD thesis! The thesis “Robust Semi-supervised Learning: Projections, Limits & Constraints” is about exploring the limits of the guarantees one can give for whether a semi-supervised learner will outperform its supervised counterpart. In other words: the limits of the usefulness of additional unlabeled data in a supervised learning setting.</p>
<p>While it seems to make sense you would want such guarantees before using semi-supervised methods to incorporate unlabeled data into the learning process, we show that without additional assumptions, for many supervised learners, semi-supervised alternatives with strict non-degradation guarantees can not be constructed. Perhaps surprisingly, however, for some supervised learners/loss functions, semi-supervised methods can be constructed that give strict non-degradation guarantees. For details of how these concepts (guarantees, performance) were defined, please see the thesis itself. After thinking about this topic for such a long time, I’m quite proud of how the thesis turned out. You can find it <a href="http://www.jessekrijthe.com/thesis.pdf">here</a>.</p>
<p>(As an aside: the thesis is typeset using R’s knitr package and Latex. You can find the source code to generate figures and text <a href="https://github.com/jkrijthe/RobustSSL">here</a>)</p>
<p>As part of the thesis defense, at most Dutch universities, you are expected to include a set of propositions, some pertaining to the core claims made in the thesis, some about your field of research and some about science/society in general. These were my propositions included in the thesis:</p>
<ol>
<li>Semi-supervised learning without additional assumptions, beyond those inherent in the supervised classifier, is possible.</li>
<li>One can guarantee performance improvement of some semi-supervised learners over their supervised counterparts.</li>
<li>Truly safe semi-supervised learning is impossible for a large class of commonly used classifiers.</li>
<li>Considering a classification method’s performance in terms of the actual loss it minimizes at train time gives useful insights.</li>
<li>There is a limit to the usefulness of asymptotic results.</li>
<li>Rather than hoping for practice to better correspond to current statistical methods, we need new methods that better match the adaptive way statistics is used in practice.</li>
<li>The focus in statistical practice on hypothesis testing is feeding society’s appetite to clear cut answers in a reality where none are available.</li>
<li>Data is uninteresting without a model, while a model can be interesting without data.</li>
<li>Publishers have become a dispensible part of scientific communication.</li>
<li>Our unwillingness or inability to define our actual goals, combined with a need for certainty, lead to surrogate measures (e.g. GDP, H-index, wealth, `likes’ on social media) that are actively harmful.</li>
</ol>
<p>I again want to thank my opposition committee: Ludmila Kuncheva, Peter Grunwald, Jelle Goeman, Erik van den Akker and Tom Heskes for the insightful questions that made for a super enjoyable discussion.</p>
Hacking's Introduction to Probability and Inductive Logic
http://www.jessekrijthe.com/articles/hacking-inductive-logic/
Wed, 06 Sep 2017 20:00:00 +0000jkrijthe@gmail.com (Jesse Krijthe)http://www.jessekrijthe.com/articles/hacking-inductive-logic/
<p><img src="http://www.jessekrijthe.com/img/hacking-inductive-logic.jpg" width="400" class="sidenote" /></p>
<p>I recently read through Ian Hacking’s “Introduction to Probability and Inductive Logic”. I was interested in learning more about the philosophical basis of different ways of reasoning from experiences/past data. Having come across this book, I was hoping to find some discussion of these ideas, which are addressed in the final part of the book.</p>
<p>The book starts by drawing the connections and differences between deductive logic (risk free: conclusion is true of premises are true) and inductive reasoning (not risk free: i.e. following an argument may lead to the wrong conclusion even if the premises are correct). An interesting analogy drawn is that in inductive reasoning:</p>
<blockquote>
<p>Criticizing the models is like challenging the premises. Criticizing the analysis of the model is like challenging the reasoning.</p>
</blockquote>
<p>The second part of the book offers a basic and well written introduction into probability theory. Not much new here, except that I learned that the first axiomatization of probability was by Christiaan Huygens in 1657. Part three discusses decision theory, where I like the use of paradoxes to present various points.</p>
<p>In part four the dichotomy of belief-type and frequency-type probability is introduced. Both types are then covered separately. When talking about frequency type probability and p-values, Hacking clearly differentiates between p-values and Neyman-Pearson hypothesis testing. Referring to Fisher, Hacking mentions:</p>
<blockquote>
<p>Above all, he thought that statistics and frequency-type probabilities were an essential tool for the communication of information among scientists working in fields where there may not be very good theories about nature, but where there is a great deal of data which needs to be understood.</p>
</blockquote>
<p>I though this was interesting, since, on the one hand, it seems to make sense, but on the other hand, it seems this is exactly what leads to replication problems caused by multiple testing and adaptive data analysis in fields that can not rely on “very good theories”.</p>
<p>Hacking then mentions Neyman’s argument against regarding confidence intervals and hypothesis testing testing as <em>inductive inference</em>, but only as corresponding to <em>inductive behaviour</em>: sticking to such methods is a policy that is correct x percent of the time, but says nothing about the current inference. Hacking explains Neyman’s strong views on this point make sense in the historical context. The then new concept of the confidence interval was easily mistaken as a probability statement of an individual event occurring. There was also the danger that contemporaneous ideas like fiducial inference may have lead people to believe confidence intervals have a similar interpretation.</p>
<h2 id="problem-of-induction">Problem of Induction</h2>
<p>This discussion about inductive inference vs. inductive behaviour lead in the final part of the book. Personally, I found this to be by far the most interesting part of the book. Here Hume’s problem of induction is introduced: one can not prove that the future will be related to the past without falling back on circular reasoning. To evade this problem of induction, Popper argues to use only conjectures, deductive reasoning and refutation instead of aiming for direct induction, and recognizing the fallibility of all current knowledge.</p>
<p>Similarly, to evade the problem of induction, arguments can be made for both the belief-type framework and the frequency-type framework. In the belief-type framework, the claim is not that Bayesian probability updating leads to a valid inductive inference, but that the way we update the beliefs is rational. Hacking mentions this argument is not complete though: we still need to show that beliefs that are consistent at each point in time are also consistent when we move to the next time point (after seeing evidence). In other words, one needs to be “true to one’s former self”, which is, as Hacking notes, a type of moral argument. This moralism comes back in the frequentist evasion of the problem of induction as well.</p>
<p>In the frequentist-type framework, one evasion of the problem of induction is that offered by Neyman and discussed above, by only considering inductive behaviour and not inductive inference. Given our assumptions in the model, by using a particular strategy of making decisions, we will be correct most of the time. One issue is of course that in my particular given situation, this long run behaviour does not help me directly. I was especially interested by Peirce’s arguments on dealing with this. Essentially, because the number of decisions in our life is always finite, it is hard to argue for long run probabilities unless we consider the collective behaviour of whole community, leading Peirce to morally sounding claim that “Logic is rooted in the social principle”. To contrast this with Hume’s argument, Hacking concludes:</p>
<blockquote>
<p>Hume thought in a hugely individualistic way, of how <em>my</em> beliefs are formed by <em>my</em> experiences. Peirce thought in a collective and communal way, in terms of how <em>our</em> beliefs are formed on the basis of <em>our</em> inquiries”</p>
</blockquote>
<h2 id="theory-of-applied-induction">Theory of Applied Induction</h2>
<p>The book indeed offers an introduction into inductive logic. Overall, I think the book is worth checking out, particularly the first and final chapters (1-2, 20-22). What I would have loved to see, is examples that are more similar to those encountered in actual data analysis and how these relate to the principles of (evading) induction discussed in the book.</p>
<p>In chapter 18, for instance, when discussing significance testing, there is no real discussion about multiple testing. While it is mentioned that a significant result is just one step of the scientific process, it is suggested that this results should then inform new directions of research. But without considering multiple testing and adaptive data analysis, such a strategy might lead us astray. What would therefore be great is the addition of some chapters that bridge the gap between the idealized examples and the messiness of the combination of adaptive and confirmatory data analysis we often deal with in the real world. A discussion of these issues and the role frameworks like <a href="https://errorstatistics.com">error statistical viewpoint</a>, <a href="http://www.stat.columbia.edu/~gelman/research/published/philosophy.pdf">hypothetico deductive Bayesianism</a> etc. might play here would be very interesting, but is perhaps beyond the scope of an introduction.</p>
<p>On a related note, I, as Hacking and many others, aim to be statistically eclectic: use the method and inferential framework that best fits your question of interest. Unfortunately, while that sounds nice as a statement of intent, I have seen few people offer much advice on how to actually match methods to research goals.</p>
<p>The book was an interesting gateway into thinking more about such problems.</p>
Eurovision Winning Probabilities
http://www.jessekrijthe.com/articles/eurovision-betting-prob/
Tue, 09 May 2017 17:09:13 +0000jkrijthe@gmail.com (Jesse Krijthe)http://www.jessekrijthe.com/articles/eurovision-betting-prob/<p>This week marks the <a href="https://en.wikipedia.org/wiki/Eurovision_Song_Contest_2017">62nd edition of the Eurovision Song Contest</a>: an annual event where countries from across Europe and beyond (Australia is competing as well) come together to perform 3 minute pop songs.</p>
<p><a href="https://www.kaggle.com/c/Eurovision2010">Predicting the outcome</a> of the contest poses an interesting statistical problem since the rules of the comeptition have been relatively stable over the years, so there is some data to base future predictions on, yet there is only a single contest every year, making it easy to overtrain a model on the limited data available.</p>
<p>Perhaps the most commonly reported predictions by the media are those implied by the odds set by bookmakers. In this note, I want to explore what probabilities the bookmakers’ odds correspond to for this year’s competition, as well as how well these probabilities predicted the winner in recent years.</p>
<p>Let’s start with the most important part, this year’s probabilities:</p>
<p><img src="http://www.jessekrijthe.com/articles/eurovision-betting-prob_files/figure-html/current-odds-1.png" width="576" /></p>
<p>Italy is the clear favourite. As we’ll see below, the markets are relatively confident in this year’s favourite actually winning, compared to the previous four years.</p>
<p>A short note on the methodology: I converted the decimal odds reported by the bookmakers to probabilities and then divided these by the total probability assigned by a bookie to all the countries combined. The total probability is bigger than one, which reflect the advantage the bookmakers have over their customers. The division by this total is a crude way to correct for this advantage. The probabilities reported here correspond to the median probability of all the bookmakers I had data for.</p>
<p>These are the probabilities (close to) the monday before the competition in previous years:</p>
<p><img src="http://www.jessekrijthe.com/articles/eurovision-betting-prob_files/figure-html/previous-odds-1.png" width="350px" /><img src="http://www.jessekrijthe.com/articles/eurovision-betting-prob_files/figure-html/previous-odds-2.png" width="350px" /><img src="http://www.jessekrijthe.com/articles/eurovision-betting-prob_files/figure-html/previous-odds-3.png" width="350px" /><img src="http://www.jessekrijthe.com/articles/eurovision-betting-prob_files/figure-html/previous-odds-4.png" width="350px" /></p>
<p>In 2014 the bookmakers were clearly off, although it is hard to say whether this is bad, given the sample of only 4 years. In the other three years, the winner was assigned a reasonably high probability. It is interesting how skewed towards Italy the probabilities are this year. In previous years, there was usually a second country with a reasonably high probability. Whether this reflects a clear preference by the European voters, we’ll have to see during the final this Saturday.</p>
<p><em>Update after the contest:</em> Portugal won, which, given the large probability placed on an Italy win by the bookmakers, does not help increase confidence in the hypothesis that these probabilities are properly calibrated.</p>
Favourite Work at ICML 2015
http://www.jessekrijthe.com/articles/icml2015/
Wed, 05 Aug 2015 13:09:13 +0000jkrijthe@gmail.com (Jesse Krijthe)http://www.jessekrijthe.com/articles/icml2015/<p>This post is just to remind myself of some of my favourite posters/presentations that I saw while attending ICML. I have undoubtably missed a lot of interesting stuff. If you have any particular suggestions, please let me know!</p>
<p><a href="http://jmlr.org/proceedings/papers/v37/betancourt15.pdf">The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling</a><br/>
<em>Michael Betancourt</em><br/>
I liked the topic and the kind of analysis and I especially liked his clear style of presentation. Moreover, there was quite a lively discussion about whether this incompatibility is actually a problem, or whether it focussed too much on only the bias that is introduced by naive subsampling.</p>
<p><a href="http://jmlr.org/proceedings/papers/v37/salimans15.pdf">Markov Chain Monte Carlo and Variational Inference: Bridging the Gap</a><br/>
<em>Tim Salimans, Diederik Kingma, Max Welling</em><br/>
The presentation and poster were a bit hard for me to follow but the problem seems important.</p>
<p><a href="http://jmlr.org/proceedings/papers/v37/lopez-paz15.pdf">Towards a Learning Theory of Cause-Effect Inference</a><br/>
<em>David Lopez-Paz, Krikamol Muandet, Bernhard Schölkopf, Iliya Tolstikhin</em><br/>
Interesting use of Maximum Mean Discrepancy in a clear analysis of an important problem.</p>
<p><a href="http://jmlr.org/proceedings/papers/v37/blundell15.pdf">Weight Uncertainty in Neural Network</a><br/>
<em>Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra</em><br/>
I have not looked into how exactly their approach is different from previous attempts at incorporating weight uncertainty, but the updates for the weight parameters seemed surprisingly simple.</p>
<p><a href="http://jmlr.org/proceedings/papers/v37/ramaswamy15.pdf">Convex Calibrated Surrogates for Hierarchical Classification</a><br/>
<em>Harish Ramaswamy, Ambuj Tewari, Shivani Agarwal</em><br/>
I like this idea of classification calibrated losses and this seems like an interesting extension to hierarchical loss functions.</p>
<p><a href="http://jmlr.org/proceedings/papers/v37/narasimhana15.pdf">Optimizing Non-decomposable Performance Measures: A Tale of Two Classes</a><br/>
<em>Harikrishna Narasimhan, Purushottam Kar, Prateek Jain</em><br/>
The authors consider functions of the true positive rate and true negative rate and come up with two classes of such functions and an approach to maximize them. The one class includes measures like the G-mean and the H-mean, while the other class includes the F-measure and Jaccard coefficient.</p>
<p><a href="http://jmlr.org/proceedings/papers/v37/jiao15.pdf">The Kendall and Mallows Kernels for Permutations</a><br/>
<em>Yunlong Jiao & Jean-Philippe Vert</em><br/>
The authors consider the problem of learning from permutations or rankings instead of vector of real valued numbers. In particular, they construct PSD kernels based on Kendall’s coefficient and Mallows kernel in order to apply kernel methods to the problem.</p>
<p><a href="http://arxiv.org/pdf/1501.05427v3">Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE)</a><br/>
<em>Maurizio Filippone & Raphael Engler</em><br/>
This seems to tackle the important problem exact quantification of uncertainty in covariance parameters for gaussian processes with seemingly few constraints on the number type of covariance function.</p>
<p><a href="http://jmlr.org/proceedings/papers/v37/hugginsb15.pdf">Risk and Regret of Hierarchical Bayesian Learners</a><br/>
<em>Jonathan H. Huggins & Joshua B. Tenenbaum</em><br/>
Again, an interesting analysis of an important problem, although it will take me some more time to study the actual result.</p>