Here is a sampling of a few programming projects that I've worked on.
Experiments to test the hypothesis that is extremely easy to select a subset of covariates to obtain a desired level of significance. This work was part of my master's thesis and a publication in the The American Statistician, The Perils of Balance Testing in Experimental Design: Messy Analyses of Clean Data.
In an effort to learn more about machine learning, I've decided to go through the textbook Machine Learning: a Probabilistic Perspective by Kevin Murphy. I also want to learn more Python, so I've decided to write up solutions to selected exercies in Jupyter notebooks.
In general, I try to derive the mathematical results in the notebooks, too, so that the code is not completely opaque. When it comes to modeling, at times I will use an existing scikit-learn implementation, but often, I will try to implement the model myself with NumPy and SciPy when I feel that doing so is instructive. Plotting is done with matplotlib.
For my undergraduate thesis, I extended a mathematical model of the loop of Henle. Fluid flow and a feedback loop were modeled with partial differential equations. In particular, we related pressure and flow with the Hagen-Poiseuille equation, and modeled active transport with Michaelis-Menten kinetics. To solve the equations, I wrote a numerical solver in C. These C routines were called from MATLAB. Data visualization was done in MATLAB, too.
We found that ion concentration principally depended on transit time through the loop and that when pressure is perturbed, the loop acted as a low-pass filter for ion concentration.
The code can be found on GitHub.
Snapstream Searcher is a search engine for closed-captioning television scripts. It features a domain-specific language for queries of arbitrary complexity and the ability to correlate a large number of terms by their proximity to each other.
I developed it as a research tool for Professors Robin Pemantle and Diana Mutz at the University of Pennsylvania.
Try the tool at Snapstream Searcher.
Markdown Ace Editor is the
textarea editor on this site, which is used for comments, posts, and biographies. As you may be able to tell, it's heavily influenced by the editor at Mathematics Stack Exchange. While visually appearing similar, under the hood, I've replaced the Markdown converter with marked and the
textarea editor with Ace Editor for the Emacs keybindings.
Infection was an assignment when I was interviewing for a data scientist position at the Khan Academy. The idea behind it is that their product is used in the classroom, so instead of A/B testing users, they are A/B testing groups of users. Thus, we can see students and coaches as directed graph, where we draw an edge from each student to a coach. We call the groups of users that are testing the new feature infected. Thus, we want to optimally assign users to the infection group, hence the name.
You can see a visualization of this process here.
SAS to CSV is a C-based converter to convert SAS7BDAT files to a CSV file. It's not very interesting, but it has a lot of utility and it's my most starred project on GitHub. I learned a lot of C and bit operations doing this project. The inspiration for this project came when a client sent us a large SAS7BDAT file that I wanted to read into R.
SAS7BDAT is a proprietary binary file format, which other programs can't read, whereas any program can read a CSV file. Currently, it only works on SAS7BDAT files generated from a specific systems. Eventually, I plan on updating it to have some 64-bit support.