October 30 turned out to be the day of the “Future of Statistics”, with a talk I attended by Andrew Gelman in the early afternoon, as well as the Future of Statistics Unconference later in the day. A short recap of what I took away from these meetings.
Talk Andrew Gelman
Prof. Gelman discusses the current crisis of non-replicability of results in science and how Bayesian methods may help to sort out some of the issues (most of them seemed like behavioural issues to me, regardless of the methods you use):
- The difference between “significant” and “not significant” is not itself statistically significant
- Flat priors give inferences we can’t believe
- There is a difference between research hypotheses and statistical “hypotheses”
- The statistical significance filter
- Researcher degrees of freedom
- The garden of forking paths (researchers degress of freedom even if no explicit p-values are considered)
- The “That which does not destroy my statistical significance makes it stronger” fallacy
- Quest for certainty
- Sign (Type S) and magnitude (Type M) errors
- Interactions are important but difficult to estimate with precision
His proposed solutions:
- Open science (data, publication of succesful and unsuccesful studies, open discourse, replication)
- Retrospective Design calculation
- Informative priors
- Hierarchical models for interactions
Open about assumptions, especially when combined with the ability to reproduce results using the original data. A lot of it involves realistic approach in which most effects are small, and few results are shocking. This is how the world works.
Future of Statistics Unconference
Hadley Wickham: Statistical Software
He started out introducing a model of how we do data analysis: Munge/Visualize/Model
Cognition vs. computation
Small data: cognition time » computation time
Big data: computation time » cognition time
Can we have both in statistical software?
He identified 2 main trends for the future of statisical software:
- Everything is moving to the web, interactivity is going to be everywhere
- Open research/source
Daniela Witten: Big Data Inference
Prof. Witten argued one of the big areas of statistics is going to be ‘Statistical Learning for Big Data’. Whereas we have gotten very good at prediction, we lack methods for inference. We are not satisfied with the back boxes that a good prediction mechanism may offer. To understand the underlying processes we need tools for inference. Inference for machine learning hard. She referred to recent work on obtaining p-values for variables in LASSO regression. This seems quite topical given the recent high profile publication of Tishirani.
Joe Blitzstein: Statistics Education
Prof. Blitzstein talked about the future of statistics education. Specifically he made remarks aimed at undergraduate level statisics courses and graduate level statistics courses.
Undergraduate level:
- Probability education, discuss why some distributions are famous, less calculus. Thinking about conditioning.
- More context, greater integration with computer science
- More practical work: actual real interesting datasets.
Graduate level:
- Role of measure theory: do a better job of integration in the curriculum
- Emphasize that we can be both Bayesians and frequentists
Hongkai Ji: Connecting datasets
Prof. Ji dicussed the future of statistics in Biology. Real scientific problems motivate development of statistical methods. He identified the following trends:
- High throughput technologies -> new methods
- Vast amounts of publicly available data
Example: Gene Transcription Factor binding (many different cell types/genes.) Cross data type prediction both predictors and outputs high dimensional. We are moving from hyptohesis driven to data driven More heterogenous data
Roles of statistics:
- Safeguards against wrong claims
- Engine for discovery
This offers ppportunities to delve into real scientific problems.
Sinan Aral: Big Experiments
More important than Big Data: Granularity and precise experimental control to get at causal inference. Why causal inference is important: what happens if you intervene or change variables.
Opportunities of big experiments:
- Get a handle on treatment effect heterogeneity (leads to personalized policies)
- Not just what the outcome of an intervention will be, also why. Informs new interventions
- Subtle but economically significant effects
Challenges of big experiments:
- Data are not i.i.d. (human datasets)
- Sampling, selection bias
- Interference: treatments will interfere with non-treated and other treated
- Design strategies
- Inference strategies
Hilary Mason
Statistics and business Statistics lead to understanding Big data: gives the oppertunity to play with data Data Scientist: Math, Software Engineering, Telling stories. What do they deliver:
- Business intelligence (real-time)
- Product
- Research into opportunities
What it is like Most interesting data science opportunities are in startups, because of speed
Tools That non-statisticians can use, tools that can be applied at different scales