11-05-2021

Tempo de leitura: 8 minutos

Machine Learning for noobies: meet practical examples

By João Varela and António Capela, Data Scientists @ Xpand IT

Machine Learning (ML) is a field that has received a lot of attention in recent years. According to the latest Artificial Intelligence HypeCycle released byGartner, Machine Learning is no longer an area with overly high expectations, but is now in a more mature phase, with the real productivity plateau of the technology application in the very near future. So if you think ML might be a job for you, now is the right time to become an engineer in this area. In this article we will talk about what Machine Learning is, list examples of applications of it and share some advantages and disadvantages of working in this field. In the end we hope you will get a clearer idea about whether working as an ML engineer is your dream job.

What is Machine Learning?

According to IBM, Machine Learning is “a form of artificial intelligence that allows a system to learn from data rather than from explicit programming”. Fortunately for all of us, artificial intelligence is not yet a super-intelligence that will take over the world and wipe out the human race, but rather a set of advanced mathematical techniques and algorithms that allow you to identify patterns and trends in large amounts of data in order to automate processes or extract insights to help make decisions. This is why the most advanced voice assistance systems like Alexa don’t have sentience yet. They are just very intelligent systems that can recognise patterns of sounds in their users’ voices. Even Mr. Stark, when he developed his assistant, named it J.A.R.V.I.S (“just a rather very intelligent system”).

machine learning for noobies giph AI — Source: Giphy

What is a Machine Learning model?

We can start by thinking of a Machine Learning model as a child learning to speak, the child receives stimuli from its parents with examples of words that it must try to imitate and, little by little, after many attempts the child begins to repeat these sounds. In this analogy the parents’ stimulus is our dataset, a set of examples that the child must try to interpret, and later imitate; the dataset has great influence on our model (the words most repeated by the parents will be more easily learned by the child, the language spoken by the parents will be the same as that of the child and even any pronunciation in the parents’ voice will influence the child’s language). This is why in ML our dataset must always be sufficiently representative of what we are trying to model.

The child learning process is known in the ML world as the training phase. In this phase small adjustments are made to the model iteratively in order to bring our result closer with the desired result. In this model — the difference between our model result and the desired result is called error — the small adjustments that are made to the model during the training phase are calculated through mathematical operations based on this same error. During the learning process, each child has his or her own difficulties and difficulties. This makes that there are different learning methods for each child. Analogously, in ML we have different types of models: some more appropriate for a certain type of data, others for when there are large amounts of data, etc.

This learning of the child is not only valid for speech, the same ideology can be applied to learning to walk or later to learning at school. Similarly, ML models have several applications: they can receive weather data for the next day and predict the number of ice creams that will be sold that day, they can receive an image and try to identify what object is present in it. They can receive the sound picked up by a user’s voice and identify if the user wants to turn on the light in the room or prepare their typical coffee, among many others.

As mentioned earlier, Machine Learning contains several types of algorithms that enable distinct learning based on a set of data. Taking the first example mentioned above, these systems look at data whose prediction is already known. In this case, the weather report and the list of ice cream sales over the past year and iteratively learn to recognize patterns in this data (fewer ice creams are sold on rainy days, more are sold when it is hot, etc.). After this training step, the algorithm allows you to calculate, with a certain level of uncertainty, what the predicted value of ice cream sold will be on future days. This type of analysis allows the ice cream shop owner to manage how many ice creams he needs to have in stock to sell the next day, to define the number of workers needed for the next day, etc., which leads to more informed shop management and possibly cost optimisation.

How are Machine Learning systems different to more traditional software systems?

A more traditional software system implements a well-defined strategy or algorithm. That is, all the conditions of the problem must be well spelled out in its implementation. ML does not in any way replace these traditional systems. If we want to develop an application to sell ice cream at home, we want the behaviour of the application to be always identical: the user fills in his address, chooses his preferred ice cream, makes the payment, and receives his fresh ice cream at home.

This is a kind of problem that is completely solved only with traditional software system. If, on the other hand, the ice cream shop owner wants to predict which number of ice creams he will sell on the next day using traditional techniques, all situations would have to be portrayed (If they have more than 30ºC and less than 20% humidity 7 ice creams will be sold, if the temperature drops to 25ºC it will be 6, etc.). It is in this type of problems that ML models can help, whereas in a traditional system the predictions depend on a set of rules created by a developer.

machine learning for noobies giph ice cream — Source: Giphy

In a Machine Learning system, this set of rules are adjusted automatically by our model, in an intrinsic way, based on a dataset with past examples. In the case of the ice cream shop, the model produced could receive input parameters with daily records such as temperature, humidity, number of sales in recent days, day of the week, whether it is a holiday, etc., and make a prediction of the number of ice creams sold that day, also offering expectations of accuracy and mathematical guarantees of the validity of the chosen approach. This new paradigm is mainly relevant in more complex problems where large amounts of data with millions of records and thousands of variables are used, in which it is more difficult to define these conditions.

What has contributed to the popularity of Machine Learning

Despite the more recent hype of this technology, the concept of Machine Learning dates back to the 1950s, when the first machine learning algorithms were developed. As previously mentioned, these models need previously classified data sets to perform their training process; furthermore, the iterative processing of large amounts of data requires very high computational power, which was not possible at the time. According to IBM, there are 6 reasons that gave rise to this advance in artificial intelligence:

The most modern processors are increasingly powerful;
The cost of storing large amounts of data is increasingly lower, notably with the emergence of cloud platforms;
The emergence of distributed computing technologies;
The increase in data production, which benefits the training of this type of models;
The implementation of these models is increasingly shared openly, which contributes greatly to their use and research;
Data visualisation techniques are becoming increasingly accessible.

Machine Learning Applications

The most recent advances in ML have allowed the implementation of this technology in different business areas. You can check some of these examples in the last article we launched at Xpand IT about what is DataScience. Some more recent, much talked about examples of the application of these models are AutoPilot, Deep Fake or Open AI.

AutoPilot

Tesla is developing ML systems that allow vehicles to drive without human interference. Based on a specified destination, and different types of sensors installed in the vehicle such as cameras and radars, it is already possible to take a trip without ever touching the steering wheel of the car.

machine learning for noobies car — Source: foxadhd.com

Deep Fake

Deep Fake is an ML algorithm that allows the creation of fake videos. Taking this video as an example, excerpts from Barack Obama’s speeches were used to create a fake video of him, altering his facial expressions and speech. This tool, despite being very dangerous, can be used in the entertainment business, or to create interactive platforms in museums.

Open AI

Open AI has developed an intelligent system capable of playing Dota 2. After playing the equivalent of 10000 years of the game against itself, the model was able to defeat the best team in the world. Dota 2 is a game that requires a lot of coordination, and this victory was considered a great milestone. Although it is only a game, this is a controlled environment where tests of these types of models can be made, and then applied in real areas such as robotics, autonomous driving or medicine.

machine learning for noobies giph robot — Source: Giphy.com

Advantages and disadvantages of being a Machine Learning engineer

To try to understand the challenges that we encounter on a daily basis in ML development, here are some points that we consider most relevant to understand the advantages and disadvantages in this type of projects.

Advantages:

The range of ML applications is very wide, which means that you have the possibility not only to choose projects that you find most captivating, but it also gives you the opportunity to work and learn in many different business areas.
As an ML engineer you have the opportunity to have a direct influence on people’s lives. Although it sounds a bit cliché, it’s true that the model you’re developing will impact someone’s life, whether it’s because you recommended a new product to them, or because you added 5 minutes of free time by giving them directions for the fastest way home.
You have the possibility to work with many different tools, such as different programming languages, graphic libraries, distributed computing systems, and much more.

Disadvantages:

In the area of machine learning there are not 100% correct answers, there is a wide variety of approaches to arrive at a result, a result that is also difficult to validate. The many approaches and uncertainties can make this area a little overwhelming, and difficult to master.
Being a constantly developing area, the techniques you learn can quickly become outdated. Constant learning of new models and tools is necessary.

Conclusion

Machine learning is an area that is currently in great development, more and more companies have the obligation to use these technologies to become competitive in the market, this leads to a great demand for skilled labor in the area, being the right time to join the ML hype train. We hope this article has contributed to consolidate your opinion about what it is to be an ML engineer, and who knows we might meet in some project in the future. If you want to know a little more about our work, you can visit Xpand IT’s DataScience page.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	Used by Google reCAPTCHA, which protects our site against spam enquiries on contact forms.
cli_user_preference	1 year	This cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
cookielawinfo-checkbox-[CATEGORY]	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	1 year	Used by GDPR Cookie Consent plugin to store whether or not the user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__adroll	1 year 1 month	This cookie is set by AdRoll to identify users across visits and devices. It is used by real-time bidding for advertisers to display relevant advertisements.
__adroll_fpc	1 year	AdRoll sets this cookie to target users with advertisements based on their browsing behaviour.
__adroll_shared	1 year 1 month	Adroll sets this cookie to collect information on users across different websites for relevant advertising.
__ar_v4	1 year	This cookie is set under the domain DoubleClick, to place ads that point to the website in Google search results and to track conversion rates for these ads.
_clck	1 year	Microsoft Clarity sets this cookie to retain the browser's Clarity User ID and settings exclusive to that website. This guarantees that actions taken during subsequent visits to the same website will be linked to the same user ID.
_clsk	1 day	Microsoft Clarity sets this cookie to store and consolidate a user's pageviews into a single session recording.
_fbp	3 months	Used by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_ga	2 years	Used by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_*	2 years	Used by Google to distinguish users.
_gat_UA-*	1 minute	Used by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_gid	1 day	Used by Google Analytics registers a unique ID that is used to generate statistical data on how the visitor uses the website
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores the true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInSessionSample_*	2 minutes	Hotjar sets this cookie to determine if a user is included in the data sampling defined by your site's daily session limit.
_hjRecordingEnabled	never	Hotjar sets this cookie when a Recording starts and is read when the recording module is initialized, to see if the user is already in a recording in a particular session.
_hjRecordingLastActivity	never	Hotjar sets this cookie when a user recording starts and when data is sent through the WebSocket.
_hjSession_*	30 minutes	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjSessionUser_*	1 year	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjTLDTest	session	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
_te_	session	Adroll Group registers a unique ID that identifies a returning user's device. The ID is used for targeted ads.
AnalyticsSyncHistory	1 month	Used by LinkedIn to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
anj	3 months	AppNexus sets the anj cookie that contains data stating whether a cookie ID is synced with partners.
ANONCHK	10 minutes	The ANONCHK cookie, set by Bing, is used to store a user's session ID and verify ads' clicks on the Bing search engine. The cookie helps in reporting and personalization as well.
bcookie	2 years	Used by LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
CLID	1 year	Used by Microsoft Clarity. The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
CMID	1 year	Casale Media sets this cookie to collect information on user behaviour for targeted advertising.
CMPRO	3 months	CasaleMedia sets CMPRO cookie for anonymous usage tracking and targeted advertising.
CMPS	3 months	CasaleMedia sets CMPS cookie for anonymous user tracking based on users' website visits to display targeted ads.
fr	3 months	Used by Facebook to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies store information about how the user uses the website to present them with relevant ads according to the user profile.
KRTBCOOKIE_*	3 months	Pubmatic sets this cookie to register a unique ID that identifies the user's device during return visits across websites that use the same ad network.
li_sugr	3 months	LinkedIn sets this cookie to collect user behaviour data to optimise the website and make advertisements on the website more relevant.
MR	7 days	This cookie, set by Bing, is used to collect user information for analytics purposes.
msd365mkttr	2 years	Microsoft Dynamic 365 collects information on user behaviour on multiple websites. This information is used in order to optimize the relevance of advertisement on the website.
msd365mkttrs	session	It allows the use of a specific form that sends the data filled in by the user to Microsoft Dynamic 365.
MUID	1 year	Identifies unique web browsers visiting Microsoft sites. These cookies are used for advertising, site analytics, and other operational purposes.
PugT	1 month	PubMatic sets this cookie to check when the cookies were updated on the browser in order to limit the number of calls to the server-side cookie store.
SM	session	Microsoft Clarity cookie set this cookie for synchronizing the MUID across Microsoft domains.
SRM_B	1 year 24 days	Used by Microsoft Advertising as a unique ID for visitors.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
UserMatchHistory	1 month	Used by LinkedIn for Ads ID syncing.
uuid2	3 months	The uuid2 cookie is set by AppNexus and records information that helps differentiate between devices and browsers. This information is used to pick out ads delivered by the platform and assess the ad performance and its attribute payment.
VISITOR_PRIVACY_METADATA	5 months 27 days	Cookie used by Youtube and used to track and enrich the users privacy settings on the Youtube platform.
YSC	session	Used by Youtube to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	Used by YouTube to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	Used by YouTube to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	Used by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	Used by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect a user's first pageview session, which is a True/False flag set by the cookie.
_icl_visitor_lang_js	1 day	Used by WPML WordPress plugin. The purpose of the cookie is to store the redirected language.
bscookie	2 years	Used by LinkedIn remembering that a logged in user is verified by two factor authentication.
CONSENT	2 years	Used by YouTube via embedded youtube-videos and registers anonymous statistical data.
cxssh_status	3 months 8 days	Currently being analyzed and have not been classified into a category as yet.
lang	session	Used by LinkdIn to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings.
li_gc	2 years	Used by Linkedin to store consent of guests regarding the use of cookies for non-essential purposes.
lidc	1 day	Used by LinkedIn to facilitate data center selection.
VISITOR_INFO1_LIVE	5 months 27 days	Used by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
wpml_browser_redirect_test	session	Used by WPML WordPress plugin and is used to test if cookies are enabled on the browser.