Course final projects by

The course is now concluded, the grades are entered. The final activity of the semester was the final project presentations day, when 16 student teams presented the results of the 10-week class final project. The breadth of topics for the final projects was impressive and was made possible in part by over 15 class partners and collaborators who contributed and supported data-driven challenges for the students.

partners

Each final project was unique. Driven by the importance of the applied context, the teams worked hard to make their project stand out. In the end, the final project presentations showcased great valuable work.

For example, the Deloitte team made a tablet app allowing to interactively display and explore high-accident road segments in the US. The Harvard-MIT team created a great studio-quality interactive visualization exploring factors contributing to purchases of carbon offsets for an Australian airline. The Diffeo team beat the last year’s winners of the DARPA challenge to algorithmically score relevancy of 9 TB of text documents. The Starbucks team came up with new models beating the standard operations research methods in terms of predictive accuracy when forecasting shipment occurrence and timing. The Harvard Stats team discovered that homophily plays a key role in the formation of the international financial network, which might have international policy implications. And these are just some of the projects.

I am excited for the students in the class who will go on to do great data-driven work – they are surely well-equipped for it.

Harvard Stat 221 “Statistical Computing and Visualization”: all lectures online by

Stat 221 is Statistical Computing and Visualization. It’s a graduate class on analyzing data without losing scientific rigor, and communicating your work. Topics span the full cycle of a data-driven project including project setup, design, implementation, and creating interactive user experiences to communicate ideas and results. We covered current theory and philosophy of building models for data, computational methods, and tools such as d3js, parallel computing with MPI, R.

All lecture slides are now available online:

  • Lecture 1, Course Introduction
  • Lecture 2, Introduction to Visualization, Modeling, and Computing (VMC)
  • Lecture 3, Intro VMC – Modeling and Computing
  • Lecture 4 – Guest Lecture by Rachel Schutt, Introduction to Data Science
  • Lecture 5, A More Rigorous Look at Visualization
  • Lecture 6, Statistical Models and Likelihood
  • Lecture 7, Likelihood Principle, MLE Foundations, Odyssey
  • Lecture 8, Stochastic Optimization for Inference, Odyssey
  • Lecture 9, Modeling with Missing Data/Latent Variables
  • Lecture 10, Expectation-Maximization Algorithm (EM)
  • Lecture 11, EM for HMMs, Properties of EM
  • Lecture 12, EM variants, Data Augmentation
  • Lecture 13, Likelihood + Prior = Posterior (Bayesian Inference)
  • Lecture 14, Missing Data and MCMC
  • Lecture 15, Hamiltonian Monte Carlo (HMC)
  • Lecture 16, Decision Theory and Statistical Inference
  • Lecture 17, Parallel Statistical Computing
  • Lecture 18, Parallel Tempering
  • Lecture 19, Message Passing Interface (MPI) for Parallel Tempering
  • Lecture 20, Equi-Energy MCMC Sampler
  • Lecture 21, Approximate Methods: Variational Inference
  • Lecture 22, Variational EM, Monte Carlo EM
  • Lecture 23, Hacker Level: Data Augmentation
  • Lecture 24, Interactive Experiences and Us
  • Lecture 25, The Final Lecture: Summing It Up

I feel privileged to have been invited onto this journey together with the students. Together, we learned substantial theory, created interactive visualization, defined open problems in current research, structured our thinking about interactive user experiences, and are now finishing up working on course final projects with a roster of first-class course partners.

While the lectures are over, the journey of learning and new discoveries in data-driven projects doesn’t stop here. If anything, it’s only getting more interesting.

Critique of an interactive visualization: “Finding love on Craigslist” by

This is a T-shirt competition entry by Lo-Hua Yuan. Lo-Hua won the comp this time and is getting a T-shirt, congratulations!

If you’re looking for love, check out Craigslist. I’ve never used Craigslist before, but after seeing Dorothy Gambrell’s graphic, I was curious whether or not the source of its data is still functioning. Indeed, under Craigslist’s personals, there’s a section for “missed connections” that lets people post missed chances at love, with the far-out hope that their potential new love(r) will see the obscure blurb. The posts usually begin with something like, “I saw you the other day at that one place… you were…” Call me maybe?

craigslistlove

Continue reading

Critique of an interactive visualization: “Wealth inequality in America” by

This is a winning T-shirt comp submission, by Sidd Viswanathan. Congratulations Sidd!

I present here a critique and explanation of the famous “wealth inequality” data visualization, a video that has caught many people’s attention lately. The link is here, and I encourage all to watch:

wealth-inequality

Continue reading

Critique of an interactive visualization: “World’s food consumption” by

This is a T-shirt competition entry by Yiqun Zhao. Yiqun won a T-shirt this time, congratulations!

I found there is a visualization on the world’s food consumption. This is a static visualization done in 2011. This visualization included data for 20 highest consuming countries and 20 lowest consuming countries. The data points are located on a world map. When one moves the mouse onto the data point, country name, calories, and incomes spend on food will appear. Two charts, Calories consumed and Income spent on food, are shown at the bottom of the visualization webpage. This data visualization has certain advantages and disadvantages illustrated as follows.

worldfood

Continue reading

Stat 221 T-shirts have arrived! by

The class has 6 T-shirt competitions (5 finished and the final one pending) and code performance competitions based on homeworks. All winners get free T-shirts. Today, the T-shirts have arrived and the hardworking competition winners can claim theirs!

female-221-tshirt

The course graphic design mastermind Sofia Hou designed two versions, and we ordered a nice variety for everyone.

male-221-tshirt

There are in fact two versions of the front logos, and the lucky competition winners can pick up a T-shirt with their favorite design.

t-shirtlogos

The teaching staff says all caps THANK YOU both to the competition winners, and those who weren’t yet lucky enough to get a free T-shirt. Students sometimes worked 50+ hours on their homework submissions to design better code and make them run faster for the code performance leaderboards, or came up with creative T-shirt competition entries.

In addition to the already awarded T-shirts, there are more chances to win it! We will be awarding two more T-shirts, one for comp 6, and one for pset 5 code performance leaderboard. Subject to size availability, we also plan to award T-shirts to all members of the team with the best final project presentation as determined by the audience!

We look forward to the 15+ final project presentations of work done together with the likes of Starbucks, Nationwide Insurance, eBay, Deloitte, MIT, and others.

We learned a lot in the class, and the T-shirt comps helped. We’ve learned to write better code, how to understand interactive visualization, and how to be comfortable with data-intensive problems without giving up scientific rigor. Most importantly, we learned where to go next.

Critique of an interactive visualization: “How many households have a savings account?” by

This is a T-shirt competition submission by Allen Schmaltz.

I recently came across an intriguing visualization on the website associated with the book Poor Economics by Abhijit V. Banerjee and Esther Duflo.

The visualization presents summary statistics regarding poor households in 17 countries and two administrative units within India. The user can readily navigate a variety of thematic areas (including among others, Education, Health, and Entrepreneurship) via the tabs at the top of the visualization. Within each thematic area, the user can select a specific set of summary statistics via a drop-down menu. The subsets are presented as questions, such as “How many households have any kind of insurance?”. Within each subset, the user chooses the income per day of the subset (on the range from $1/day to $6-10/day).

householdsavings

Clicking on the colored circles on the map displays a popup with a short synopsis of an applicable article and a link to the full article. Mousing over a particular area on the world map highlights the associated bar in the graph and provides a tooltip with the name of the area and its population. Clicking on the gray circles on the map, or the bars or flags in the chart, displays summary statistics for that particular area. The transitions when switching between all of these various views are smooth and fluid. Continue reading

Critique of an interactive visualization: “One day of Wiki edits” by

This is a T-shirt competition entry by Kevin Oh. Kevin was the second T-shirt winner for competition 5, congratulations!

This interactive data visualization released from Wikimedia is intended to summarize and show the usage of Wikipedia around the globe. Consisting of five different pages, the visualization dedicates first three pages on displaying frequency of wiki page edit on different part of the world followed by two pages of view summary. Interactivity of this visualization is somewhat interesting, but not very effective. Unlike most of the recent interactive visualizations, users can communicate with the visualization only through keyboard commands. For instance, flipping over to different pages can be done by pressing a number button between 1 and 5, and press other keys to change the displayed language or the view. Though the use of keyboard as the interface could save some display area by removing all the buttons, but certainly not as convenient or intuitive as using GUI.

wikipedia

The goal of this visualization is also not very clear. It roughly summarizes how Wikipedia pages are used across the world, but what the audience, supposedly Wikipedia users, can gain from this information is not clear. Most prevalent goal of this kind of visualization is marketing such that encourage more people to start using Wikipedia and business corporations to pay more money for advertisements. However, Wikipedia is known to reserve itself from any outside advertisements. Thus, as marketing is not the possible explanation, the goal of this visualization is at best unknown. Continue reading

Critique of an interactive visualization: “The last three months on Foursquare” by

This is a T-shirt competition entry by Jay Baxter. Jay won the T-shirt this time, congratulations!

Foursquare, a location-based social networking app for mobile devices that allows users to publicly post their location (“check-in”), made an interactive visualization of the last 500 million check-ins. It is an information summary aimed at showing off Foursquare’s popularity around the globe while simultaneously providing an engaging and informative user experience.

foursquare

The user is first presented with a black map of the entire world with a white dot for every check-in. It is quite beautiful and stunning: it looks very similar to satellite pictures at nighttime, where only densely populated regions are lit up. The user is allowed to zoom and scroll around the map, search for a city to zoom there, and toggle satellite imagery and map labels. The visualization is particularly fun if you zoom in all the way on the area you live to see which areas of stores are the most popular. Continue reading

Hamiltonian Monte Carlo, Parallel Tempering, MPI: lectures 14-19 of Stat 221 by

Stat 221 is over 75% done! We have learned how to learn about the parameters of complex models with point estimation in the earlier lectures. Lectures 14-19 are mostly about the more computationally intensive full MCMC sampling methods such as Hamiltonian Monte Carlo and Parallel Tempering. We also discussed how Decision Theory treats making decisions based on system parameters and uncertainty about them, and parallelizing computing algorithms with the Message Passing Interface:

I am very excited for the students who are now in the middle of the implementation stage for their final projects, which in addition to academic projects include industry partnerships with companies like Ebay, Starbucks, Nationwide Insurance, Deloitte, and many others. Can’t wait for the final project presentations in the beginning of May! Based on the most recent team status updates on their final projects, we will see some great work there.