Project Description and Dataset
Suppose you are working as a Data Scientist for a real estate investment firm. The
firm is assessing locations for investing in housing redevelopment in the United
States. For this purpose, the firm has identified several potential locations in Seattle
to purchase existing houses, which would be demolished to make space for the
In order to estimate the costs involved the firm needs to know the current market
value of the houses that it needs to purchase. You are working on a project that aims
to build a model to estimate the house prices.
Seattle’s Department of Assessments has been collecting data since 2014 on house
sale prices and the characteristics of each house that was sold. You have been
given access to a copy of original database “house.db”, which is an SQLite file, as
well as a data dictionary file “house_dict.txt”. You can download the dataset and
detailed dataset description from the BUSS6002 Canvas site.
Hint: To list all tables in the database you can use the following query
SELECT name FROM sqlite_master WHERE type=’table’ ORDER BY name;
To start your analysis, you wish to perform a thorough EDA to help you better
understand the given datasets. The results you obtain in this task will be used to
inform your modelling choice.
a. Check and deal with any missing data (if any) in the given dataset.
b. Look for and remove any potential outliers (if any) that would possibly affect
your modelling. Justify your answer.
c. Visualise the relationships between explanatory variables and the target
variable through appropriate plotting. Report your analysis and findings.
Suppose now you want to build a prototype model to predict house sale prices, which
will be demonstrated to a wider team. Therefore, it needs to be easily understood by
non-experts, meaning that you can only use a few variables in your model as a starting
In order to make informed decisions on your modelling choices, you need to answer
the following questions:
a. Suppose you would like to build a linear regression model to predict house sale
prices, do you wish to include an intercept term in your model? Carefully explain
b. Do you think multicollinearity could be a potential problem on the given dataset?
Use your understanding of variables to justify your answer and verify your
hypothesis using appropriate numeric measures. Explain your decisions to
proceed based on your findings.
c. If you wish to use only three variables to predict house sale prices, which three
variables would you choose? Carefully justify your choice and explain your
d. Build a linear regression model using the three variables you have chosen (Use
original, i.e. not engineered, variables for this task). Report and interpret your
e. Perform residual diagnostics to measure the goodness of fit. Report your
The model you have built so far provides an approximate estimate of house prices.
However, to accurately estimate the costs of the redevelopment plan you must be
able to estimate house prices as accurately as possible.
Your goal is now to improve your model as much as you can through feature
engineering and feature selection. You may consider all variables and apply
appropriate transformation to the variables as necessary.
a. Your model should have a minimum adjusted R-Squared of 75%. If your
modelling cannot achieve an adjusted R-Squared of 75%, report the best
model you can obtain.
b. Justify your choice of feature engineering strategies using EDA and present
c. Compare your new model with the model you have built in Task 2 with respect
to Adjusted R-Squared. Explain why you should use Adjusted R-Squared here
to compare the two models.
d. Provide residual analysis to justify why your new model is more reasonable.
Suppose you have finished your analysis, now you need to report to your manager
and reflect on what you have experimented with in your project:
a. Provide a reflection of how you have utilised the data science process model
to arrive at modeling and model evaluation based on how you answered the
previous three questions. Choose only one process model (CRISP-DM or
Snail Shell) to answer this question. Explain how each part of the questions
aligns with the different phases of the process model.
b. The firm is also considering redevelopment projects in other locations.
Comment on whether the model you have built can or cannot be applied in
other locations. Justify your answer.