This report is for the final course of the IBM Data Science Specialization hosted on Coursera platform. The project allows learners to be as creative as they want and come up with an idea to leverage the location data available via FourSquare API to compare neighborhoods of a city of choice, come up with a problem which can be solved using that data.
In our problem statement, we have a group of athletes who are planning to live in Seattle for several weeks. They would need to find several flats, so it’s desirable that they are located nearby to make the collective work-outs easier. Additional preferences include presence of a park nearby and low criminality in that district because they are planning to be outside very often (jogging in the evenings, etc). Also, the apartments should be affordable, but the factor of low criminality is valued higher by our clients.
The target audience for this report are:
- potential buyers, who can roughly estimate which neighborhoods are more desired (and the models used for analysis should be easily adjustable),
- real estate builders and planners who can decide what kind of neighborhoods are more attractive on the market to maximize selling price of newly built flats,
- and of course, to this course’s instructors and learners who will grade my project,
- anyone who is curious how Python can be applied to easily crawl web pages; parse CSV or JSON files; create powerful visualizations of data as scatter plots, heat maps, density plots using matplotlib, seaborn and map visualizations using Folium; process data using lists, dictionaries, pandas DataFrames.
All the code with data analysis is available on my GitHub page.
Seattle city neighborhoods were chosen as the observation target due to following reasons:
- there is a lot of statistical data freely available for USA,
- diversity of neighborhoods: Seattle is a rather large city with very different districts,
- availability of geolocation data to allow for visualizations on a map.
For the data acquisition part, we use this Wikipedia article to find out Seattle’s district names and coordinates. For most of them, we couldn’t find any additional information like population size. We get the population information from the portal ‘Find My Seattle‘. Crime data is available from official sources of City of Seattle. For prices of flats we use data set provided by Airbnb on Kaggle portal. To locate parks nearby flats, we access FourSquare API.
The process of collecting and cleaning data:
- we use Python libraries `requests` and `lxml` to scrap web pages of Wikipedia and locate the attributes and tags of interest using XPath, follow the URLs of all neighborhoods and retrieve the geographical locations (see Figures 1 and 2 below),
- population sizes of Seattle’s districts we enter by hand into a CSV file,
- we use Python library `json` to process crime data, which we then analyze on the monthly rolling basis and normalize by districts’ population sizes;
- to measure proximity to parks, we utilize FourSquare API, namely the `Search for Venues` request with a corresponding categoryId of `4bf58dd8d48988d163941735` (see Figures 3 and 4),
- Airbnb listings are available in CSV format.
Districts are named differently across the data set. Therefore, we map some districts in crime data onto bigger districts from population sizes data and vice versa: population of some districts must be summed up to obtain a bigger district so that there is a one-to-one correspondence between districts’ names.
We do similar normalization for population data from FindMySeattle.
From the crime data, we filter out crime categories which aren’t of interest for our clients and focus our efforts on violent crimes which have happened in the past decade only (we also truncate data for an incomplete month of January 2019).
We then form monthly breakdowns by district of crimes in that district and normalize it by dividing them by the population of said district (see Figure 6, size of the circle is proportional to number of crimes per capita). It looks like the most dangerous districts are Georgetown, Pioneer Square and Chinatown, but there are quite some fluctuations in monthly figures.
Due to this high volatility, we can’t simply utilize the last data point for our analysis: there might be some seasonal patterns, etc. Therefore, for our needs, we consider crime rate on a rolling basis with window of two years (see Figure 7 as an example for Chinatown – International District). This way the crime rate is smoothed out and allows us to have a single figure per district. Breakdown of rolling crime rate is presented in Figure 8.
To obtain coordinates of all parks we query FourSquare, however, it has a limitation of returning no more than 100 parks at once. To overcome this, we use the districts coordinates we obtained from Wikipedia, make a request for each district’s coordinate to FourSquare and then combine the results (see Figures 3 and 9).
Then for each listing of a flat in Airbnb, we measure its geographic distance to every park of Seattle using `distance` Python package and persist the minimum proximity (see Figure 10).
Our client wants to find out districts which contain many flats meeting their criteria:
- affordable in price;
- low criminality in that region;
- proximity to parks.
First, let’s get some insights into our data using visualizations.
Figure 11 presents a histogram of distribution of rolling crime rate (where each tick represents an observation, and bars indicate how many fell into the same bin) and a distribution of rental prices for flats by district. There doesn’t seem to be clear dependency to rental price (Figure 12), also crime rate follows Poisson distribution whilst rental prices are distributed normally.
Pearson correlation between these metrics is quite low at 0.178 and p-value of 0.38 suggests this slight positive correlation is insignificant.
Choosing a method for data analysis
First method which was suggested to our client was context-based recommender, where we would be able to find districts similar to those preferred by the client. However, our clients have never been to Seattle and were unable to specify districts they’ve liked.
Unsupervised learning techniques like k-means clustering were ruled out because they are too sensitive to the scaling of the datasets (remember that our data follows different distributions), difficulty to predict the number of clusters, order of the data having impact on the final result.
We have suggested to our client that for each of our data sources, we would create attribute groups. For crime rate: safe, normal, dangerous; for rental price: low, affordable, expensive; for proximity to parks: close, further, far.
Thus, for each apartment listed on Airbnb we compute each of these scores, then sum them up (potentially with some weights) to obtain an overall score, and then filter out those not meeting a desired minimum cut-off score.
This approach is easy for clients to understand, it is extendable to include other metrics, and the weights can be adjusted to prioritize different attributes.
For this case study, it has been agreed to assign scores between 0 and 10 according to the quantile of the attribute’s distribution. It has been agreed that criminality score is valued 1.5 times higher than score for proximity to parks and price score has weight of 1.25. And an apartment needs to acquire a total score of 24 to be of interest, and additionally, it’s price and parks score must each be at least 5, and criminality score at least 6 (our clients had concerns that we might end up with choosing apartments which are cheap and close to a park full of drug dealers).
Then, to compare districts to each other, we order them by ratio of desired apartment to the total number of listings.
Scores for apartments have been computed and an interactive map with markers been prepared (see Figures 14 and 15).
When looking at the ratio of desired flats to total number of listings in a district, the most desired neighborhoods to begin search are Capitol Hill, Madison Park and Green Lake (see Table below). For instance, in Capitol Hill almost every third apartment is desired and in Madison Park every fourth!
As it can be seen from scatter plots comparing our scores, there is no clear dependency between scores. However, there appear to be some regions where values’ dependencies are more expressed. This suggests that it might be beneficial to approach the search for desired apartments slightly differently: instead of taking the granularity of districts as is, one could form regions based on the density of desired apartments in them.
Here is how it would look like in our case if we were to plot a hex grid with geographic coordinates, where intensity of a color in each cell corresponds to the number of desired flats which happened to be in it:
It is hard to find balance between different attributes of good housing. We have provided our clients an interactive tool to meet their desired criteria and make it easy to understand the trade-off for each particular offer. The tool is extensible and flexible to include other attributes or adjust the priorities of attributes.
(C) Dr Yury Chebiryak, January 2019