julia 0.3.6: Analyzing the Old Faithful data

Mar 6, 2015 · 1170 words · 6 minutes read julia

Every time I check-in, it seems that Julia has taken another small step forward. When I first learned of Julia I encountered a lot of issues that prevented me from performing even simple data tasks. Now at version 0.3.6 I’ve decided to make another attempt at using Julia for a mini-analysis. The purpose of this is to see how Julia might grow as a contender against other languages (e.g. R, Python) for performing statistical analyses.

During my first There is a GitHub page dedicated to statistics and machine learning in Julia. Noteably the DataFrames library adds important functionality for performing statistical analyses including representation of missing data.

Old faithful

Since the purpose of this mini-project is to casually see what we can do with Julia, I thought I’d use a trustworthy and dependable dataset: the Old Faithful Geyser data.

Old Faithful is a well-known geyser at Yellowstone National Park. The geyser has a fairly regular cycle, erupting every 35 to 120 minutes for $$1 \frac{1}{2}$$ to 5 minutes at a time. While there are several different versions of this data, they all contain at least the eruption duration time in minutes and the waiting interval between eruptions. This particular version is a sample taken during August 1978 and August and includes the day and a categorization of the duration as “long” (greater than 3 minutes) or “short” (less than 3 minutes).

The very basics

The DataFrames library is already installed, but if you need to install it you can use Pkg.add('DataFrames'). Loading the data is then pretty easy, just use readtable(). By default Julia will set the current working directory to the same current directory of your bash shell or whatever shell you used to launch Julia. If you launch Julia from the application icon it defaults to your home directory (homedir()). After loading the data, Julia spits out the head and tail of the dataset.

julia> using DataFrames
julia> old_faithful = readtable("/Users/ellisvalentiner/Desktop/old-faithful/Old-Faithful.csv")
107x5 DataFrame
| Row | x   | day | interval | duration | dur_cat |
|-----|-----|-----|----------|----------|---------|
| 1   | 1   | 1   | 78       | 4.4      | "L"     |
| 2   | 2   | 1   | 74       | 3.9      | "L"     |
| 3   | 3   | 1   | 68       | 4.0      | "L"     |
| 4   | 4   | 1   | 76       | 4.0      | "L"     |
| 5   | 5   | 1   | 80       | 3.5      | "L"     |
| 6   | 6   | 1   | 84       | 4.1      | "L"     |
| 7   | 7   | 1   | 50       | 2.3      | "S"     |
| 8   | 8   | 1   | 93       | 4.7      | "L"     |
| 9   | 9   | 1   | 55       | 1.7      | "S"     |
| 10  | 10  | 1   | 76       | 4.9      | "L"     |
| 11  | 11  | 1   | 58       | 1.7      | "S"     |
⋮
| 96  | 96  | 8   | 73       | 4.4      | "L"     |
| 97  | 97  | 8   | 70       | 4.1      | "L"     |
| 98  | 98  | 8   | 88       | 4.1      | "L"     |
| 99  | 99  | 8   | 75       | 4.0      | "L"     |
| 100 | 100 | 8   | 83       | 4.1      | "L"     |
| 101 | 101 | 8   | 61       | 2.3      | "S"     |
| 102 | 102 | 8   | 78       | 4.6      | "L"     |
| 103 | 103 | 8   | 61       | 1.9      | "S"     |
| 104 | 104 | 8   | 81       | 4.5      | "L"     |
| 105 | 105 | 8   | 51       | 2.0      | "S"     |
| 106 | 106 | 8   | 80       | 4.8      | "L"     |
| 107 | 107 | 8   | 79       | 4.1      | "L"     |

Further we can easily summarize the variables in the data with the describe() function – similar to the summary() function in R or the describe() method in Python using the pandas library.

julia> describe(old_faithful)
x
Min      1.0
1st Qu.  27.5
Median   54.0
Mean     54.0
3rd Qu.  80.5
Max      107.0
NAs      0
NA%      0.0%

day
Min      1.0
1st Qu.  3.0
Median   5.0
Mean     4.514018691588785
3rd Qu.  6.0
Max      8.0
NAs      0
NA%      0.0%

interval
Min      42.0
1st Qu.  59.0
Median   75.0
Mean     71.0
3rd Qu.  80.5
Max      95.0
NAs      0
NA%      0.0%

duration
Min      1.7
1st Qu.  2.3
Median   3.8
Mean     3.457943925233645
3rd Qu.  4.3
Max      4.9
NAs      0
NA%      0.0%

dur_cat
Length  107
Type    UTF8String
NAs     0
NA%     0.0%
Unique  2

I like that describe() returns the count and percent of values that are NA. Unlike R, readtable() does not automatically convert strings to a factor-type but instead leaves them as strings. Since I usually begin an R session with options(stringsAsFactors=FALSE) this is a very nice change. In Julia we want to convert these from a string to a PooledDataArray.

julia> pool!(old_faithful, [:dur_cat])

The pool! function takes a dataframe and array of columns to convert to the PooledDataArray type. In Julia column referencing is done using :column_name. If we had multiple columns to convert to factors we can simply comma separate them in the array.

Data visualization

Creating plots in Julia is done using the Gadfly library. It is similar to Hadley Wickham’s ggplot2 in R (and also inspired by Leland Wilkinson’s The Grammar of Graphics). Perhaps one of the coolest things about Gadfly is built-in support for scaleable vector graphics (SVG) with embedded javascript. This format enables mild interaction with the plot (e.g. zooming, panning, etc).

julia> using Gadfly
julia> draw(SVGJS("old_faithful_scatter.js.svg", 6inch, 4inch),
    plot(old_faithful, x="duration", y="interval", color="dur_cat", Geom.point,
    Guide.XLabel("Duration"), Guide.YLabel("Interval"), Guide.colorkey("Duration category")))
julia> draw(SVGJS("old_faithful_boxplot.js.svg", 6inch, 4inch),
	plot(old_faithful, x="dur_cat", y="interval", color="dur_cat", Geom.boxplot, Guide.XLabel("Duration"), Guide.YLabel("Interval"), Guide.colorkey("Duration category")))

Click and drag on the images for full effect.

Simple linear regression

At last we come to regression. The syntax in the GLM package is similar to R in that it accepts a formula but requires that you specify the distribution family for the errors and the response link.

julia> using GLM
julia> fit = glm(interval ~ duration, old_faithful, Normal(), IdentityLink())
DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal,IdentityLink},DensePredChol{Float64}},Float64}:

Coefficients:
             Estimate Std.Error z value Pr(>|z|)
(Intercept)   34.0046   2.25038 15.1106   <1e-50
duration      10.6987  0.623421 17.1612   <1e-65

Take aways

Julia has come a long way since I last played around with it, around version 0.3.0 or so. It feels like real language. The syntax is familiar to both R and Python, although I think there is a stronger Python “feel”.

The biggest pain point was that loading packages is slow. I mean really slow. This is a known issue and it looks like package loading times will be improved in future releases. Otherwise the only problems I had were related to learning the language – not with the language itself.

I’m looking forward to watching how Julia continues to progress. Today I only scratched the surface of the type of work I would like to use Julia for. There is still a lot of learning to do before integrating Julia into my work full-time.