06-06-2019

Tempo de leitura: 2 minutos

Data Science Hands-on: Predicting movies’ worldwide revenue

On May 4th, a day known worldwide as Star Wars Day (“May the fourth“), approximately 40 Data Science lovers seized this occasion to learn more about this subject by practicing and sharing on yet another Lisbon Kaggle Meetup. The “Data Science Hands-on” Meetup took place at Instituto Superior Técnico (IST Campus) and it was precisely dedicated to cinema:

the problem addressed consisted in predicting movies’ revenue before their premiere!

This event was also sponsored by Xpand IT, in collaboration with Hackerschool Lisboa, a group of IST students interested in technology, who also evangelizes the practice of learn-by-doing.

First off, the event started with a presentation by Xpand IT’s own Ricardo Pires, who introduced the company and their units focused on data treatment and exploration. Participants received a sample of how these problems fit in a real-world context. Shortly after, professor Rui Henriques, who teaches Data Science at IST, explained his perspective on how to approach a Data Science problem, providing some tips related to the meetup’s challenge.

Data from this challenge leverage learning and provide an idea of a potentially real problem, as they are semi-structured and demand a great amount of effort to process.

An estimated 80% of Data Scientists’ daily work revolves around data treatment.
Forbes

After the two presentations, participants started to unravel the mysteries hidden within the data. They verified, for example, a generalized increase in revenue over the years. They also noticed that American movies had a superior revenue, compared to all the rest.

data-science-hands-on-revenue-movies-e1559051633694

Tackling the challenge

On the first part, participants modelled the problem with simpler columns, structured as:

budget
popularity
runtime
data

By doing so, they’ve tried to obtain the first predictions for the movies’ revenue. On the image below, which represents Spearman’s rank correlation coefficient, we can verify that budget and popularity columns are the most correlated with revenue.

data-science-hands-on-spearman-correlation-e1559052009574

During the second phase, contestants tackled the semi-structured columns, applying the one-hot encoding technique, as:

director
cast

Through this deeper analysis of the data, teams found out that the movies that generated more revenue (see table below).

data-science-hands-on-movies-budget-revenue-e1559051015543

Other relevant aspect to consider is that popularity is not always directly related with revenue, such is the case with “Transformers: Dark of the Moon”, as it is represented as less popular, but with a high revenue nonetheless.

It is also interesting to observe the actors who generated more revenue on average:

Conclusions

At the end of the meetup, participants shared their implemented solutions:

The group with the best results applied Logistic Regression. Despite being a simple model, it can provide adequate results when the focus is data treatment.
Data treatment went through several techniques, such as detection of outliers, in movies with a very discrepant budget, replacing these values with the median.
Budget and revenue columns were transformed into their respective logarithm, in order to approximate them to a Gaussian distribution.
One of the advantages of using a simpler model is that these are also easier to explain to a business stakeholder.

The fourth of May was spent learning alongside the most wonderful people, enlightening in every way. In case you’re interested in Data Science, join the community and show up at our monthly events.

More information on the “Data Science Hands-on” Meetup.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	Used by Google reCAPTCHA, which protects our site against spam enquiries on contact forms.
cli_user_preference	1 year	This cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
cookielawinfo-checkbox-[CATEGORY]	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	1 year	Used by GDPR Cookie Consent plugin to store whether or not the user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__adroll	1 year 1 month	This cookie is set by AdRoll to identify users across visits and devices. It is used by real-time bidding for advertisers to display relevant advertisements.
__adroll_fpc	1 year	AdRoll sets this cookie to target users with advertisements based on their browsing behaviour.
__adroll_shared	1 year 1 month	Adroll sets this cookie to collect information on users across different websites for relevant advertising.
__ar_v4	1 year	This cookie is set under the domain DoubleClick, to place ads that point to the website in Google search results and to track conversion rates for these ads.
_clck	1 year	Microsoft Clarity sets this cookie to retain the browser's Clarity User ID and settings exclusive to that website. This guarantees that actions taken during subsequent visits to the same website will be linked to the same user ID.
_clsk	1 day	Microsoft Clarity sets this cookie to store and consolidate a user's pageviews into a single session recording.
_fbp	3 months	Used by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_ga	2 years	Used by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_*	2 years	Used by Google to distinguish users.
_gat_UA-*	1 minute	Used by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_gid	1 day	Used by Google Analytics registers a unique ID that is used to generate statistical data on how the visitor uses the website
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores the true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInSessionSample_*	2 minutes	Hotjar sets this cookie to determine if a user is included in the data sampling defined by your site's daily session limit.
_hjRecordingEnabled	never	Hotjar sets this cookie when a Recording starts and is read when the recording module is initialized, to see if the user is already in a recording in a particular session.
_hjRecordingLastActivity	never	Hotjar sets this cookie when a user recording starts and when data is sent through the WebSocket.
_hjSession_*	30 minutes	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjSessionUser_*	1 year	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjTLDTest	session	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
_te_	session	Adroll Group registers a unique ID that identifies a returning user's device. The ID is used for targeted ads.
AnalyticsSyncHistory	1 month	Used by LinkedIn to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
anj	3 months	AppNexus sets the anj cookie that contains data stating whether a cookie ID is synced with partners.
ANONCHK	10 minutes	The ANONCHK cookie, set by Bing, is used to store a user's session ID and verify ads' clicks on the Bing search engine. The cookie helps in reporting and personalization as well.
bcookie	2 years	Used by LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
CLID	1 year	Used by Microsoft Clarity. The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
CMID	1 year	Casale Media sets this cookie to collect information on user behaviour for targeted advertising.
CMPRO	3 months	CasaleMedia sets CMPRO cookie for anonymous usage tracking and targeted advertising.
CMPS	3 months	CasaleMedia sets CMPS cookie for anonymous user tracking based on users' website visits to display targeted ads.
fr	3 months	Used by Facebook to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies store information about how the user uses the website to present them with relevant ads according to the user profile.
KRTBCOOKIE_*	3 months	Pubmatic sets this cookie to register a unique ID that identifies the user's device during return visits across websites that use the same ad network.
li_sugr	3 months	LinkedIn sets this cookie to collect user behaviour data to optimise the website and make advertisements on the website more relevant.
MR	7 days	This cookie, set by Bing, is used to collect user information for analytics purposes.
msd365mkttr	2 years	Microsoft Dynamic 365 collects information on user behaviour on multiple websites. This information is used in order to optimize the relevance of advertisement on the website.
msd365mkttrs	session	It allows the use of a specific form that sends the data filled in by the user to Microsoft Dynamic 365.
MUID	1 year	Identifies unique web browsers visiting Microsoft sites. These cookies are used for advertising, site analytics, and other operational purposes.
PugT	1 month	PubMatic sets this cookie to check when the cookies were updated on the browser in order to limit the number of calls to the server-side cookie store.
SM	session	Microsoft Clarity cookie set this cookie for synchronizing the MUID across Microsoft domains.
SRM_B	1 year 24 days	Used by Microsoft Advertising as a unique ID for visitors.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
UserMatchHistory	1 month	Used by LinkedIn for Ads ID syncing.
uuid2	3 months	The uuid2 cookie is set by AppNexus and records information that helps differentiate between devices and browsers. This information is used to pick out ads delivered by the platform and assess the ad performance and its attribute payment.
VISITOR_PRIVACY_METADATA	5 months 27 days	Cookie used by Youtube and used to track and enrich the users privacy settings on the Youtube platform.
YSC	session	Used by Youtube to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	Used by YouTube to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	Used by YouTube to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	Used by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	Used by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect a user's first pageview session, which is a True/False flag set by the cookie.
_icl_visitor_lang_js	1 day	Used by WPML WordPress plugin. The purpose of the cookie is to store the redirected language.
bscookie	2 years	Used by LinkedIn remembering that a logged in user is verified by two factor authentication.
CONSENT	2 years	Used by YouTube via embedded youtube-videos and registers anonymous statistical data.
cxssh_status	3 months 8 days	Currently being analyzed and have not been classified into a category as yet.
lang	session	Used by LinkdIn to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings.
li_gc	2 years	Used by Linkedin to store consent of guests regarding the use of cookies for non-essential purposes.
lidc	1 day	Used by LinkedIn to facilitate data center selection.
VISITOR_INFO1_LIVE	5 months 27 days	Used by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
wpml_browser_redirect_test	session	Used by WPML WordPress plugin and is used to test if cookies are enabled on the browser.