BUSS6002 Assignment 2
Due Date: Wednesday 6 November 2019
Value: 25% of the total mark
This group assignment has been designed to allow students to contextualise their data
science skills on a real-world problem in business domains, as well as to help students
develop collaborative skills when working in a team.
1. Required submission items via Canvas:
1. ONE written report (PDF format).
• Assignments > Report Submission (Assignment 2)
2. ONE Jupyter Notebook .ipynb
• Assignments > Upload Your Code File (Assignment 2)
3. ONE csv file of test results
• Assignments > Test Results Submission (Assignment 2)
2. The assignment is due at 12:00pm (noon) on Wednesday, 6 November 2019. The
late penalty for the assignment is 5% of the assigned mark per day, starting after
12:00pm on the due date. The closing date Wednesday, 13 November 2019,
12:00pm (noon) is the last date on which an assessment will be accepted for
3. As per anonymous marking policy, please include the Group ID and Student IDs of
all group members. Do NOT include names. The name of the report and code file
must follow: GroupID_BUSS6002_Assignment2_S12019, and the name of test
results must follow: GroupID_BUSS6002_Assignment2_Test_Results.csv.
4. Your analyses and answers should be provided as a final report that gives full
explanation and interpretation of any results you obtain. Output without
explanation will receive zero marks. You are required to also submit your code that
can reproduce your reported results, as reproducibility is a key component to data
science. Not submitting your code will lead to a loss of 50% of the assignment
5. Be warned that plagiarism between individuals is always obvious to the markers of
the assignment and can be easily detected by Turnitin.
6. Presentation of the assignment is part of the assignment. There will be 10% marks
for the presentation of your final report and/or code.
7. Numbers with decimals should be reported to the third-decimal point.
Meeting Minutes and Peer Review
1. Each group is required to submit at least 3 meeting minutes as the appendix
attached to the final report. A template will be provided for preparing meetings
minutes. You may use the template provided or a template you choose.
2019S2 BUSS6002 2
2. We may ask for peer review from each student within a group. The instructions
about how to do this will be released later.
3. Each group will be awarded a group mark as per the marking criteria. In special
cases, an individual adjustment to the group mark would be made if there is dispute
in a group or the quality/quantity of contributions made by individuals are
significantly different, in which cases the unit coordinator will seek meeting
minutes and peer review reports from individuals within a group to decide on
4. If you encounter any issues with your group members, please report and discuss
with your unit coordinator as early as possible.
A competition will be run among groups to rank the performance of your models on the
test data provided. The top 5 groups will be awarded with bonus marks to top up their
overall assignment mark: the top 3 groups will receive an extra 5 marks, and the 4th and 5th
groups will receive an extra 3 marks.
Project Description and Dataset
In recent years, we have witnessed an explosive growth of user review data generated
across social media (e.g., reviews, forum discussions, blogs, Twitter) on the Web.
Individuals and companies are increasingly using such data to better understand their
audience and make better decisions. Individual consumers can check the opinions of
existing users of a product to help them make wiser purchase decisions. Through analyzing
public and consumer opinions towards their products or services, companies can develop
comprehensive insights to customers’ experience, and use this to improve their offering,
build a better brand and improve their business. At the same time, trends and
developments of opinions in online social media provide a strong signal about the
prospects, health and value of a company, as well as its brands to stakeholders and
Suppose you are now working on a Data Science Team in a private equity firm. You are
tasked to build a sentiment analysis system that automatically predicts customers’
sentiment polarity (positive or negative) towards a range of businesses. Your analyses and
findings will assist the firm in selecting prospective businesses to invest in.
The datasets provided to your team are collected from a leading online review portal,
which contains detailed information on a collection of reviews and businesses. The
datasets are organized in three datafiles: review_train.csv, review_test.csv, and
business.csv. Only review_train.csv contains the target variable: polarity,
where 1 indicates “positive” and 0 indicates “negative”. business.csv contains
detailed information about businesses associated with review data. This information may
be helpful to build a more accurate prediction model.
The details of the features presented in the above datasets are given in description.txt.
Note that, it may not be feasible to directly use some of these features (in particular reviews
represented as raw text) to build a classification model, so one of your tasks is to carefully
extract or construct meaningful features as input for the modelling process.
2019S2 BUSS6002 3
Please note that in this assignment you will be solving a real-world problem under realistic
settings and most tasks are deliberately designed to be open ended. This gives you freedom
to explore and optimize your solution.
Exploratory Data Analysis (EDA): Conduct initial analysis and a thorough EDA for the
given datasets. This includes but not limited to: checking/dealing with missing data,
visualising the distributions of features, identifying features that can better distinguish
different target values, correlation analysis, etc. Carefully present your analysis and
findings in your report.
Benchmark Model: Build a simple logistic regression model to assess the feasibility of
the project and establish a baseline model. For this task, you are required to build your
baseline model using tf-idf vectors of review text only. Use scikit-learn’s logistic
regression model with “solver” set to ‘liblinear’ and all other parameters set to default. Use
scikit-learn’s TfidfVectorizer with “max_features” set to 500 and all other parameters set
to default. You need to use appropriate model evaluation strategies to validate your model.
Present your results and discuss your findings.
Improving Your Model: You are required to make attempts to improve the performance
of your benchmark model as much as you can. You might consider using more advanced
feature engineering techniques and incorporating extra sources of information such as
business.csv to rebuild your model. Justify any choice you make and provide
detailed explanation. You must properly validate your model and optimise appropriate
hyperparameters that apply. You should demonstrate evidence of your efforts and you will
be assessed based on the depth of your exploration. Report your settings and comparisons
with the benchmark model.
Note: For this task, if you want to use a classification model that is not taught in this unit,
you must clearly explain the principle of the model, justify why you choose that model, and
present your analysis. For any model you choose, you need to optimise your model in terms
of its parameters as well. Simply building a model without any consideration of validation
and optimisation does not meet the minimum requirements.
Interpreting Results: Decide on your best model and provide analysis and interpretation
of its behavior. For your interpretation you should focus on identifying general rules that
might be useful for the firm in the future, such as how to quickly identify popular or
unpopular businesses. For example, you may report the top 20 most important features that
most contribute to the classification of positive/negative sentiment or provide commentary
on characteristics of businesses that may receive positive/negative reviews.
Final Test Results: Finally, apply your best model on the test data. You are asked to
report the classification results on the test data. Save your results into a csv file containing
two columns, one for the Review Index (review_id from review_test.csv) and the
other column polarity for the predicted labels (1’s or 0’s). An example file of test
results test_results_example.csv will be provided. Name your file as
GroupID_BUSS6002_Assignment2_Test_Results.csv. The results on the test data will
be assessed to decide your group performance among the entire class (group competition!).
Note that, we will use F1-score as the test score for group competition.
2019S2 BUSS6002 4
• The assignment material to be submitted will consist of a final report that:
1) Takes a research article form in which you shall have a number of sections
such as introduction, methodology, experiment results,
findings/interpretation, and conclusion. All references should be properly
cited and take a full bibliographical format. Here are a few examples
2) Details ALL steps and decisions taken by the group regarding requirements
3) Demonstrates an understanding of the problem being addressed and the
relevant principles of data science techniques used.
4) Clearly and appropriately presents any relevant graphs and tables.
• The report should be NOT more than 20 pages with font size no smaller than
11pt, including everything like text, figures, tables, small sections of inserted code,
excluding the cover page and the appendix containing the meeting minutes. Think
about the best and most structured way to present your work, summarise the
procedures implemented, support your results/findings and prove the originality of
your work. You will provide your code as a separate submission to the report.
• Your code submission has no length limit, however make sure your code is as
concise as possible and add comments when necessary to explain the functionality
of your code segments.
• Your group is required to submit at least 3 meetings minutes. Your group may use
the provided template for preparing meeting minutes. Documentation should
include attendance, discussion points, actions decided, etc. You may use your own
form or find something online.
• You, as a member of a group, may be also required to submit your peer review.
Please use the provided criteria sheet for this purpose. You will be advised how to
use an online form when it becomes available.
2019S2 BUSS6002 Assignment 2 (Group Project) – Marking Guidelines
Part Criteria Max Marks
• Properly checking/dealing with missing values;
• Coherent and thorough EDA with sufficient
• Excellent presentation of analysis results.
B Benchmark Model:
• Successfully build a logistic regression model using tf-
idf vectors of review text with specified parameters.
• A demonstrated validation process with your analysis
to evaluate model performance.
• Excellent documentation of your findings and
C Improving your Model:
• Clear explanation and justification for the use of extra
features or data sources (e.g., business.csv).
• Careful justification of the choice of feature
• A demonstrated validation process with your new
model and careful optimisation of hyperparameters.
• Full report of your new model setting, analysis, and
comparison with your benchmark model.
D Interpreting Results:
• Provide proper interpretation for your best model.
E Final Test Results (Group competition):
• Successful application of your best model to test data
provided (e.g., correct number of predictions)
• Correct name and format for the final test result csv
• Well-structured report with clear presentation of text,
figures, tables, and formula (if applicable), free of
spelling and grammar errors, etc.
• Well-documented code with necessary comments.
F Bonus Mark:
• Only awarded to the top five groups in the group
competition (5 marks for the top 3 groups