Rental Listings Part 1 - Data Import/Processing

Mar 24, 2017 · 773 words · 4 minutes read R

Kaggle is a great source to find data. Last month (Feb. 2017) a competition was posted that caught my eye: Two Sigma Connect: Rental Listing Inquiries. The purpose of the competition was to predict the popularity of apartment rental listings.

I’m not particularly familiar with either Two Sigma or RentHop but I saw the data included lat/lon coordinates and I enjoy working with spatial data. While the competition goal was to predict rental listing popularity, I’m interested in exploring spatial properties of the dataset.

Data Processing

The first step is bringing in the data. It’s JSON so I used jsonlite but the data comes in as nested lists. purrr has some good tools for reorganizing the data. I transposed the lists, which turns the list of columns into a list of records, flattened each record to remove the inner top-level list hierarchy, and coerced the records into a tibble using bind_rows. Finally I used type_convert to apply tidyverse column type specifications to the data. Note: both jsonlite and purrr have functions named flatten so watch your namespaces.

raw <- jsonlite::fromJSON("2017-03-24-renthop_data/train.json",
                          simplifyVector = FALSE) %>%
    transpose %>%
    map(flatten) %>%
    bind_rows() %>%
    type_convert()

Taking a quick look at the data structure, I noticed a column with an empty name. This column was created because some listings contained multiple photos. I decided to drop the photo columns because a) I’m not interested in them, and b) the listing images are available in a separate file and organized by listing ID – so this information could be recovered by traversing the listing images directory.

## Classes 'tbl_df', 'tbl' and 'data.frame':    49352 obs. of  16 variables:
##  $ bathrooms      : num  1 1 1 1.5 1 1 2 1 ...
##  $ bedrooms       : int  1 2 2 3 0 3 3 0 ...
##  $ building_id    : chr  "8579a0b0d54db803821a35a"| __truncated__ "b8e75"..
##  $ created        : POSIXct, format: "2016-06-16 05:55:27" "2016-06-01 0"..
##  $ description    : chr  "Spacious 1 Bedroom 1 Ba"| __truncated__ "BRAND"..
##  $ display_address: chr  "145 Borinquen Place" "East 44th" "East 56th St"..
##  $                : chr  "https://photos.renthop."| __truncated__ "https"..
##  $ latitude       : num  40.7 40.8 40.8 40.7 ...
##  $ listing_id     : int  7170325 7092344 7158677 7211212 7225292 7226687 ..
##  $ longitude      : num  -74 -74 -74 -73.9 ...
##  $ manager_id     : chr  "a10db4590843d78c784171a"| __truncated__ "955db"..
##  $ price          : int  2400 3800 3495 3000 2795 7200 6000 1945 ...
##  $ street_address : chr  "145 Borinquen Place" "230 East 44th" "405 East"..
##  $ interest_level : chr  "medium" "low" "medium" ...
##  $ features       : chr  NA NA NA ...
##  $ photos         : chr  NA NA NA ...

Neighborhood Tabulation Areas

Neighborhood Tabulation Areas (NTA) are aggregations of NYC Census Tracts. New York City is densely populated, which relatively small geographic Census units. I decided to use NTAs because they’re reasonably small (smaller than boroughs) but not so small that they’d be too many without any rental listings. Arguably you could choose some other areal unit.

It was a bit easier to read in the New York City Neighborhood Tabulation Areas GeoJSON file. For some reason, I had some issues getting blogdown/Hugo to play nice with GeoJSON. Fortunately GeoJSON is just JSON and changing the file extension fixed the issue.

nyc <- geojsonio::geojson_read("2017-03-24-renthop_data/nyc_nta.json", what="sp")

Next I created spatial points, intersect them with the NYC NTA boundaries, and adding the NTA data (e.g. borough name, county FIPS code, etc.) as columns of the RentHop data. With tidy data, I computed neighborhood level variables (e.g. median price).

proj4_string <- CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")

tidy <- raw %>%
    select(longitude, latitude) %>%
    SpatialPoints(proj4string = proj4_string) %>%
    over(nyc) %>%
    bind_cols(raw) %>%
    as_data_frame()

nyc@data <- tidy %>%
    group_by(OBJECTID) %>%
    summarize(
        price = median(price, na.rm=TRUE),
        bathrooms = median(bathrooms, na.rm=TRUE),
        bedrooms = median(bedrooms, na.rm=TRUE),
        num_listings = n()
    ) %>%
    full_join(nyc@data)

Feature extraction

There are a couple of features that I created as I went along. First is just the number of characters in the description. This is a simple metric that doesn’t require a deep dive into NLP. The second feature is the distance of the listing from the center of all of the listings. I created other features as well, but these ones are particularly useful for filtering out some extreme cases.

nyc_center <- tidy %>% summarize(
    lon = median(longitude),
    lat = median(latitude)
    )

tidy %<>%
    mutate(
        description_length = {description %>%
                nchar() %>%
                as.integer()},
        km_from_centroid = {
            c(longitude, latitude) %>%
                matrix(ncol = 2) %>%
                spDists(y = nyc_center, longlat = TRUE) %>%
                as.vector()
            }
    ) %>%
    replace_na(list(description_length = 0))

Next Steps

With the data imported and tidied up, the next step was to explore the dataset.