top of page

AI and Machine Learning Applications for Social Media Platforms

By Alfred J Nigl, Ph.D. and Dean Grey

Although the future of using AI and Machine Learning in social media remains in doubt due to the global data privacy legislation, AI and Machine Learning are still being applied by social media businesses to help them understand, segment and organize the vast amounts of data that are being collected daily.

Narayan (2018) detailed six areas where AI (for purposes of simplicity, AI will be used in this paper to represent the combined application of AI with machine learning) is impacting social media marketing.

  1. Content Creation: In order to keep up with the heavy demand for content to fuel very intense rates of marketing campaigns, (the average brand is launching six campaigns a month), brand marketing professionals are using AI to discover user-generated content that can be incorporated into a marketing campaign. Twizoo is an example of a company which uses AI to automatically generate market campaign content including user-generated videos

  2. Consumer Intelligence: AI methods can uncover hidden gems in user behavior, especially leveraging the value in unstructured data and posting and comment trends and converting these consumer behaviors into actionable marketing tools to drive campaigns. Conversion is a digital consulting company that specializes in gathering market intelligence from social and customer sentiment (“voice of customer data”).

  3. Customer Service: Chatbots are AI tools that more and more companies are using to replace human operators in live chat situations. Skylab USA recently upgraded its community platform chat system by partnering with Applozic which has the capability of integrating with chatbots in the future.

  4. Influencer marketing: In his recent article, Narayan is quoted as saying “It’s imperative that brands have more intelligence into how they associate with certain influencers…. “ the use of AI methods will make the matches between brands and influencers stronger than matches made by humans alone. InsightPool is a new platform that searches through 600 Million influencers who have been identified across various social media platforms to find the best matches for influencers that fit a brand’s unique characteristics, personality and goals. Skylab is also researching brand affinities among its platform user base, however, its focus is on micro-influencers not high volume influencers. Skylab has recently completed a series of surveys which investigated the extent to which one planet (i.e., a Skylab separate business unit) had a sufficient number of micro-influencers among its user base and which brands they felt an affinity toward.

  5. Content Optimization: Many companies have leveraged the market intelligence that AI methods can produce to uncover topics and brand affinities based on customer blogs that are of very high interest among large numbers of consumers. By compiling this information, social media platforms can use this information to create high-value content that will attract millions of users. Because Skylab is more driven by user behavior not content, this would not be a high priority focus for Skylab research at this time.

  6. Competitive Intelligence: There are at least two areas where AI methods can help companies identify competitor threats and advances in order to maintain their edge and/or market share; these include using NLP to determine the meaning behind clusters of words and statistical AI to disambiguate data to create actionable insights and also to identify outliers. An important part of this process is pattern matching to differentiate important consumer social media posts and content from the trivial.

Machine Learning to help manage the 3 V’s of Big Data:

The three Vs of Big Data: Volume, Velocity, and Variety have overwhelmed traditional methods of data management and analysis. The rapid growth of Big Data has been associated with the adoption of AI and Machine Learning methods for analyzing the tremendous volume and complexity of data.

In fact, based on a recent blog post (snap.stanford.edu, 2018), every second of every day 3.3 million new posts appear on Facebook and about half a million on Twitter. To help keep track of such a high volume of data, analysts have developed highly sophisticated analytical and modeling methods using such open source tools as R and Python. The algorithms created enable computers to identify patterns and classify them into clusters. Such algorithms are perfectly configured to provide with usable information derived from unstructured data and a mix of video, other media, text messages, posts and poster preferences and with a higher degree of accuracy than possible from a human analyst or marketing specialist.

Scraping and Data Lakes

In order to intelligently process the overwhelming volume and variety of social media activity, web-scraping tools have been created to gather all the posts associated with a particular brand, store them in a data lake, from which they can be fed into algorithms to be segmented and clustered into relevant bits of information that can then be used to develop marketing messages and drive new marketing campaigns. A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse (i.e., the traditional Oracle-type relational DB) stores data in files or folders, a data lake uses a flat architecture to store data.

Data Lake

Skylab USA’s Approach to Analyzing Big Data – Tableau and Red Shift

In order to more efficiently process, scale and analyze the tremendous amount of data generated by its many unique business units (in the Skylab vernacular, called planets), Skylab has moved its data to the cloud (AWS)and using APIs to connect key data points to Red Shift. Skylab’s main analysis and visualization tool are Tableau, which was chosen because of its robust analytics engine.

The summary document below shows the variety and complexity of the Skylab platform user data which is currently being processed with the Red Shift Tableau integration.


  1. User Engagement

  2. Dau and Mau totals

  3. DAU% and MAU % compare to the total ( active user base ) which is defined by User who has been active on the platform in the last 180 days ( this time frame need to be adjustable )

  4. DAU over MAU % is a measure of “stickiness”

  5. Rank to social media

  6. Rank to skylabs Clients

  7. Brand Engagement % (this really is a stat we compare to others)

  8. Rank to other platforms

  9. Rank In skylabs world

  10. Sustained Engagement / Retention %

  11. Based off Wilson’s Law ( the 30/10/10 rule and hopefully we can find more )

  12. VRS Index ( you becoming a better you ) with weights for calculating an Index Score for each Planet

  13. Actions 20%

  14. community focused -33% of the 20%

  15. Personal Growth -33% of the 20%

  16. Helping others – 33% of the 20%

  17. Social Engagement 20%

  18. Post 20%

  19. Lesson 20%

  20. Chat 20%

  21. Retention measurement (Using the standard formula Total users who launched the app the first time, total users who went back on the app at the following intervals (dating to the first time they downloaded the app, the next day (Day 1) and then 3 days after and then 7 days after

  22. Track the average amount of time users spend on each of the planet apps on a daily and monthly basis

  23. Reporting and exporting capabilities

  24. All Data must be able to be presented in a dashboard and exported

  25. Business uses

  26. in a report for the CMO

  27. For Sponsors

  28. For Data Science

  29. for Widgets on our apps

  30. Leveraging a 3rd-party solution for quick implementation

  31. https://www.tableau.com/embedded-analytics

Community Stats module to provide data to be displayed/reported in different ways



  1. On an analytics dashboard, for Admins to see on the web

  2. On widgets, on the home screen of mobile and web apps for End Users to see

  3. On emails being sent out to End Users and Admins, at each level of the hierarchy (i.e. Experience Admin report, Planet Admin report, etc.)

  4. As csv exportable files

  5. Reports for sponsors

  6. Feeding external analytics systems like Google Analytics or Firebase Analytics

  7. Providing an API for integrations, allowing other systems to gather data from our stats module

  8. The module must allow historic reporting

  9. e.g. how many total users at a time period. today/ last week/ last month (to track growth)

  10. «Community Stats» Features general logic

  11. Widgets must work to track/gamify the Experiences as well

  12. On any Planet, Planet Admins must be able to view the following Widgets displaying Experiences instead of Users:

  13. Recently Active Widget

  14. Community Recognition Widget (a.k.a Leaderboard)

  15. On Skylab Planet only

  16. Community Impact Widget must have the ability to display aggregated stats for the Universe

  17. Planet Admin must be able to set the following Widgets to display Planets instead of Users

  18. Recently Active Widget

  19. Community Recognition Widget (a.k.a. Leaderboard)

  20. Stats Contests

  21. 3. «Backbone Data warehouse» Feature Requirements

  22. This is the foundational infrastructure and architecture that must be in place since the beginning, to enable the full business vision and features roadmap to be materialized over time

  23. Time Estimate for completion ( 2-3 days )

  24. Complete Data Points definitions

  25. Data aggregation rules / consolidation and sorting logic (Experience -> Planet -> Universe)

  26. Data transformation from RDS via KAFKA into Redshift

  27. Test Redshift/Kafka productivity and performance

  28. Verify connection, Kafka and redshift configurations on Security and Encryption

  29. Setup Monitoring Environment and integrate it with existing New Relic solution

  30. Data Reporting Mechanism

  31. Dashboard (Tableau.com)

  32. Automated Statistical Reporting (Monthly, Weekly, On-demand)

  33. Data export

  34. Data archiving and deletion mechanism

  35. Backbone V2

  36. Integration of https://segment.com/mobile with mobile (iOS, Android)

  37. Redshift Integration with Segment.io

KPI/Metrics Definition & Logic

Daily Active Users (DAU)

  1. DAU should be measured using the same criteria as we use for measuring MAU for the sake of consistency (i.e. opening up the app) or opening up the app and taking at least one action?; we need to decide)

  2. Note: Facebook’s definition of DAU is calculated by aggregating the number of users who open up the app in a 24 hour period regardless of what else they do, including nothing.

  3. Minimum

  4. Anytime front-end sends the request to get info from the backend

  5. e.g. anytime scrolling up/down through the Home Screen

  6. Mechanism to set user inactive/passive/asleep/suspend

  7. Boris Nayflish add the details in here

  8. Mechanism to re-engage without forcing a user to log back in

  9. DAU should be measured on a 24-hour basis from the 1st day of each new month to the last day for each individual planet and for Skylab as a whole; the average DAUs for a month can then be calculated and that result can be divided by the total users to get a DAU%

Monthly Active Users (MAU)

  1. Any User who has been a DAU on the last 30 days at least once

DAU/MAU ratio

  1. What is the DAU/MAU Ratio?

  2. The Daily Active Users (DAU) to Monthly Active Users (MAU) Ratio measures the stickiness of your product – that is, how often people engage with your product. DAU is the number of unique users who engage with your product in a one day window. MAU is the number of unique users who engage with your product over a 30-day window (usually a rolling 30 days).

  3. The ratio of DAU to MAU is the proportion of monthly active users who engage with your product in a single day window.

  4. Advice from VCs: Why DAU/MAU Ratio is critical

  5. “If there’s one number every founder should always know, it’s the company’s growth rate. That’s the measure of a startup. If you don’t know that number, you don’t even know if you’re doing well or badly. The best thing to measure the growth rate of its revenue. The next best, for startups that aren’t charging initially, is active users. That’s a reasonable proxy for revenue growth because whenever the startup does start trying to make money, their revenues will probably be a constant multiple of active users.” – Paul Graham, VC, and Co-Founder of Y Combinator

  6. “The metrics we start with our total active users (monthly/weekly/daily) it’s growth, alongside any ratios like DAU/MAU or DAU/WAU. These help us understand how frequently active people are in using the products.” – Josh Elman, Partner at Greylock Partners

  7. “I would argue that the single most telling metric for a great product is how many of them become dedicated, repeat users.” – Andrew Chen, Angel Investor

  8. How to calculate DAU/MAU Ratio:

  9. (#) Daily active users / (#) Monthly active users = (%) DAU/MAU Ratio

  10. The key to calculating DAU/MAU Ratio is defining what ‘active’ is for your product. This could be anything from a purchase (for e-commerce or mobile apps), pages viewed/videos watched/comments (for media/publisher), or product login/usage (for SaaS companies or mobile apps).

  11. Once you’ve defined ‘active’ for your product, determine the number of unique active users in a 24-hr period and also the number of unique active users over the past 30 days (usually based a rolling 30 days). With these two metrics, you can divide DAU by MAU to get the ratio percentage.

  12. A variation of this metric is to swap MAU with the total number of unique weekly active users (WAU). This gives you the DAU/WAU Ratio.

Super Consistent Users (SCU), Super Engaged (SEU) and Super Loyal (SLU)

  1. CONSISTENCY – SCU is a measure of User’s consistency (i.e. current streaks)

  2. 3 days

  3. 7days

  4. 30 days

  5. 90 days

  6. 6 months

  7. 1yr

  8. 2 Yrs

  9. Filters by the % of the total group per category.

  10. Filter by the total number of users to qualify for each category

  11. ENGAGEMENT – SEU is a measure of User’s engagement score points earned per time frame

  12. FREQUENCY – SFU is a measure of User’s number of sessions per day

  13. 2x

  14. 3x

  15. 4x

  16. 5x


  17. Filter by % and total users in the community

  18. ATTENTION – SAU (Super Attentive User) is a measure of User’s time per session

  19. RETENTION/LOYALTY – SLU is a measure of Users with high retention rates (i.e., coming back to the app over time)

  20. Measured by the following time periods:

  21. 3 days

  22. 7 days

  23. 30 days

  24. 60

  25. 90

  26. 6 months

  27. 1yr

  28. 2yr

  29. The period of time to calculate SEU can be variable (e.g. of the week, of the month)

Post-Tableau Predictive Analytics Process

After the data is processed through Tableau and basic visualization reports have been created, Skylab’s data science team uses the intelligence created by Tableau to build predictive models using such tools as KNIME, R, and Python.

The types of algorithms that can be applied to Skylab user data include the following:

  1. Decision tree

  2. Random forest

  3. Logistic regression

  4. Support vector machine

  5. Naive Bayes

Definitions and all illustrations below are based on information created by bigdata-made simple.com in a blog by http://bigdata-madesimple.com/10-machine-learning-algorithms-know-2018/. There are two main categories of predictive models regression models and classification models.

According to Dr. Jason Brownlee, Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the answer.

Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation. The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.

Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).

The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation.

For example, an email or text can be classified as belonging to one of two classes: “spam“ and “not spam“.

  1. A classification problem requires that examples be classified into one of two or more classes.

  2. A classification can have real-valued or discrete input variables.

  3. A problem with two classes is often called a two-class or binary classification problem.

  4. A problem with more than two classes is often called a multi-class classification problem.

  5. A problem where an example is assigned multiple classes is called a multi-label classification problem.

It is common for classification models to predict a continuous value as the probability of a given example belonging to each output class. The probabilities can be interpreted as the likelihood or confidence of a given example belonging to each class. A predicted probability can be converted into a class value by selecting the class label that has the highest probability.

For example, a specific email of text may be assigned the probabilities of 0.1 as being “spam” and 0.9 as being “not spam”. We can convert these probabilities to a class label by selecting the “not spam” label as it has the highest predicted likelihood.

There are many ways to estimate the skill of a classification predictive model, but perhaps the most common is to calculate the classification accuracy.

The classification accuracy is the percentage of correctly classified examples out of all predictions made.

For example, if a classification predictive model made 5 predictions and 3 of them were correct and 2 of them were incorrect, then the classification accuracy of the model based on just these predictions would be:

Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).

A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes.

For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.

  1. A regression problem requires the prediction of a quantity.

  2. A regression can have real valued or discrete input variables.

  3. A problem with multiple input variables is often called a multivariate regression problem.

  4. A regression problem where input variables are ordered by time is called a time series forecasting problem.

Because a regression predictive model predicts a quantity, the skill of the model must be reported as an error in those predictions.

There are many ways to estimate the skill of a regression predictive model, but perhaps the most common is to calculate the root mean squared error, abbreviated by the acronym RMSE.

For example, if a regression predictive model made 2 predictions, one of 1.5 where the expected value is 1.0 and another of 3.3 and the expected value is 3.0, then the RMSE would be:

Classification predictive modeling problems are different from regression predictive modeling problems.

  1. Classification is the task of predicting a discrete class label.

  2. Regression is the task of predicting a continuous quantity.

However as Brownlee points out, there is some overlap between the algorithms for classification and regression; for example:

  1. A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label.

  2. A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.

Some algorithms can be used for both classification and regression with small modifications, such as decision trees and artificial neural networks. Some algorithms cannot, or cannot easily be used for both problem types, such as linear regression for regression predictive modeling and logistic regression for classification predictive modeling.

Brownlee (2018) also points out that the way that data scientists evaluate classification and regression predictions varies and does not overlap, for example:


  1. Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.

  2. Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.

Types of Machine Learning Algorithms

In a post on February 6, 2017, Rahul Saxena presented an explanation of Bayesian machine learning.

Naive Bayes classifier is a straightforward and powerful algorithm for the classification task. Even if we are working on a data set with millions of records with some attributes, it is suggested to try a Naive Bayes approach.

Naive Bayes classifier gives great results when we use it for textual data analysis. Such as Natural Language Processing or NLP.

To understand the naive Bayes classifier we first need to understand the Bayes theorem.

Bayes theorem named after Rev. Thomas Bayes. It works on conditional probability. Conditional probability is the probability that something will happen, given that something else has already occurred. Using the conditional probability, we can calculate the probability of an event using its prior knowledge.

Below is the formula for calculating the conditional probability.

formula
  1. P(H) is the probability of hypothesis H being true. This is known as the prior probability.

  2. P(E) is the probability of the evidence(regardless of the hypothesis).

  3. P(E|H) is the probability of the evidence given that hypothesis is true.

  4. P(H|E) is the probability of the hypothesis given that the evidence is there.

Let’s consider an example from Skylab USA to understand how the above formula of Bayes theorem works.

Problem:

Imagine that there are two types of users identified on a particular planet(term used to designate unique business units on Skylab USA) ; those who have tried to influence others to purchase a brand “D” with two results “Positive” & “Negative.” A test is developed to measure various influencer traits and the test is determined to correctly identify Positive Influencers with 99% accuracy: if you have the positive Brand D traits, you will give test positive 99% of the time. If you don’t have these traits will test negative 99% of the time. If only 3% of all the people have these traits and test gives a particular user a “positive” result, what is the probability that that user actually is a positive Brand D influencer?

For solving the above problem, we will have to use conditional probability. The probability of people, positively influencing others to buy Brand D, P(D) = 0.03 = 3% The probability that the test gives “positive” result and a person is a positive influencer P(Pos | D) = 0.99 =99%

Probability of people not being a positive influencer of D, P(~D) = 0.97 = 97% The probability that test gives “positive” result and the person is not a positive influencer, P(Pos | ~D) = 0.01 =1%

For calculating the probability that the person is a positive influencer i.e, P( D | Pos)we will use Bayes theorem:

formula 2

We have all the values of numerator but we need to calculate P(Pos): P(Pos) = P(D, pos) + P( ~D, pos) = P(pos|D)*P(D) + P(pos|~D)*P(~D) = 0.99 * 0.03 + 0.01 * 0.97 = 0.0297 + 0.0097 = 0.0394

Let’s calculate, P( D | Pos) = (P(Pos | D) * P(D)) / P(Pos) = (0.99 * 0.03) / 0.0394 = 0.753807107

So, there is a n approximately 75% chance that the person is actually a positive Brand D influencer.

Naive Bayes Classifier

Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered the most likely class. This is also known as Maximum A Posteriori (MAP).

The MAP for a hypothesis is:

MAP(H) = max( P(H|E) ) = max( (P(E|H)*P(H))/P(E)) = max(P(E|H)*P(H))

P(E) is evidence probability, and it is used to normalize the result. It remains the same so, removing it won’t affect.

Naive Bayes classifier assumes that all the features are unrelated to each other. Presence or absence of a feature does not influence the presence or absence of any other feature. We can use Wikipedia example for explaining the logic i.e.,

A fruit may be considered to be an apple if it is red, round, and about 4″ in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

In real datasets, we test a hypothesis given multiple evidence(feature). So, calculations become complicated. To simplify the work, the feature independence approach is used to ‘uncouple’ multiple evidence and treat each as an independent one.

P(H|Multiple Evidences) = P(E1| H)* P(E2|H) ……*P(En|H) * P(H) / P(Multiple Evidences)

Example of Naive Bayes Classifier

For understanding a theoretical concept, the best procedure is to try it on an example. Since I am a pet lover so selected animals as our predicted class.

Types of Naive Bayes Algorithm

Gaussian Naive Bayes

When attribute values are continuous, an assumption is made that the values associated with each class are distributed according to Gaussian i.e., Normal Distribution.

If in our data, an attribute say “x” contains continuous data. We first segment the data by the class and then compute

(mean)

formula 3

(Variance)

formula 4

(of each class)

Formula 5

MultiNomial Naive Bayes

MultiNomial Naive Bayes is preferred to use on data that is multinomially distributed. It is one of the standard classic algorithms. Which is used in text categorization (classification). Each event in text classification represents the occurrence of a word in a document.

Bernoulli Naive Bayes

Bernoulli Naive Bayes is used on the data that is distributed according to multivariate Bernoulli distributions.i.e., multiple features can be there, but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. So, it requires features to be binary valued.

Advantages and Disadvantage of Naive Bayes classifier

Advantages

  1. Naive Bayes Algorithm is a fast, highly scalable algorithm.

  2. Naive Bayes can be used for Binary and Multiclass classification. It provides different types of Naive Bayes Algorithms like GaussianNB, MultinomialNB, BernoulliNB.

  3. It is a simple algorithm that depends on doing a bunch of counts.

  4. Great choice for Text Classification problems. It’s a popular choice for spam email classification.

  5. It can be easily trained on a small dataset

Disadvantages

  1. Naive Bayes can learn individual features importance but can’t determine the relationship among features.

Decision Tree Algorithims

  1. A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithmthat only contains conditional control statements.

  2. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.

A decision tree consists of three types of nodes:[1]

  1. Decision nodes – typically represented by squares

  2. Chance nodes – typically represented by circles

  3. End nodes – typically represented by triangles

Decision trees are commonly used in operations research and operations management. If, in practice, decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a probability model as the best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.

Decision trees, influence diagrams, utility functions, and other decision analysis tools and methods are taught to undergraduate students in schools of business, health economics, and public health, and are examples of operations research or management science methods.

Decision tree building block

MultiNomial Naive Bayes

MultiNomial Naive Bayes is preferred to use on data that is multinomially distributed. It is one of the standard classic algorithms. Which is used in text categorization (classification). Each event in text classification represents the occurrence of a word in a document.

Bernoulli Naive Bayes

Bernoulli Naive Bayes is used on the data that is distributed according to multivariate Bernoulli distributions.i.e., multiple features can be there, but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. So, it requires features to be binary valued.

Advantages and Disadvantage of Naive Bayes classifier

Advantages

  1. Naive Bayes Algorithm is a fast, highly scalable algorithm.

  2. Naive Bayes can be used for Binary and Multiclass classification. It provides different types of Naive Bayes Algorithms like GaussianNB, MultinomialNB, BernoulliNB.

  3. It is a simple algorithm that depends on doing a bunch of counts.

  4. Great choice for Text Classification problems. It’s a popular choice for spam email classification.

  5. It can be easily trained on a small dataset

Disadvantages

  1. Naive Bayes can learn individual features importance but can’t determine the relationship among features.

Decision Tree Algorithm

  1. A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

  2. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.

A decision tree consists of three types of nodes:[1]

  1. Decision nodes – typically represented by squares

  2. Chance nodes – typically represented by circles

  3. End nodes – typically represented by triangles

Decision trees are commonly used in operations research and operations management. If in practice, decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a probability model as the best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.

Decision trees, influence diagrams, utility functions, and other decision analysis tools and methods are taught to undergraduate students in schools of business, health economics, and public health, and are examples of operations research or management science methods.

Decision tree building block

Decision tree elements

Decision trees used in data mining are of two main types:

  1. Classification tree analysis is when the predicted outcome is the class to which the data belongs.

  2. Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital).

The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al.[3] Trees used for regression and trees used for classification have some similarities – but also some differences, such as the procedure used to determine where to split.[3]

Some techniques, often called ensemble methods, construct more than one decision tree:

  1. Boosted trees Incrementally building an ensemble by training each new instance to emphasize the training instances previously mis-modeled. A typical example is AdaBoost. These can be used for regression-type and classification-type problems.[5][6]

  2. Bootstrap aggregated (or bagged) decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction.[7]

  3. A random forest classifier is a specific type of bootstrap aggregating

  4. Rotation forest – in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features.[8]

A special case of a decision tree is a decision list,[9] which is a one-sided decision tree so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node). While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity, permit non-greedy learning methods[10] and monotonic constraints to be imposed.[11]

Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.

There are many specific decision-tree algorithms. Notable ones include:

  1. ID3 (Iterative Dichotomiser 3)

  2. C4.5 (successor of ID3)

  3. CART (Classification And Regression Tree)

  4. CHAID (Chi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification tree With use of test statistics generates a deep directed acyclic graph of decision rules to solve classification and regression tasks.[4]

  5. MARS: extends decision trees to handle numerical data better.

  6. Conditional Inference Trees. The statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.[13][14]

ID3 and CART were invented independently at around the same time (between 1970 and 1980)[citation needed], yet follow a similar approach for learning decision tree from training tuples.

References

Rokach, Lior; Maimon, O. (2008). Data mining with decision trees: theory and applications. World Scientific Pub Co Inc. ISBN 978-9812771711.

Quinlan, J. R., (1986). Induction of Decision Trees. Machine Learning 1: 81-106, Kluwer Academic Publishers

Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8.

Ignatov, D.Yu.; Ignatov, A.D. (2017). “Decision Stream: Cultivating Deep Decision Trees”. IEEE ICTAI: 905–912. arXiv:1704.07657 Freely accessible. doi:10.1109/ICTAI.2017.00140.

Friedman, J. H. (1999). Stochastic gradient boosting. Stanford University.

Hastie, T., Tibshirani, R., Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer Verlag.

Breiman, L. (1996). Bagging Predictors. “Machine Learning, 24”: pp. 123-140.

Rodriguez, J. J.; Kuncheva, L. I.; Alonso, C. J. (2006). “Rotation forest: A new classifier ensemble method”. IEEE Transactions on Pattern Analysis and Machine Intelligence. 28 (10): 1619–1630. doi:10.1109/TPAMI.2006.211.

Rivest, Ron (Nov 1987). “Learning Decision Lists” (PDF). Machine Learning. 3 (2): 229–246. doi:10.1023/A:1022607331053.

Letham, Ben; Rudin, Cynthia; McCormick, Tyler; Madigan, David (2015). “Interpretable Classifiers Using Rules And Bayesian Analysis: Building A Better Stroke Prediction Model”. Annals of Applied Statistics. 9: 1350–1371. arXiv:1511.01644 Freely accessible. doi:10.1214/15-AOAS848.

Wang, Fulton; Rudin, Cynthia (2015). “Falling Rule Lists” (PDF). Journal of Machine Learning Research. 38.

Kass, G. V. (1980). “An exploratory technique for investigating large quantities of categorical data”. Applied Statistics. 29 (2): 119–127. doi:10.2307/2986296. JSTOR 2986296.

Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework”. Journal of Computational and Graphical Statistics. 15 (3): 651–674. doi:10.1198/106186006X133933. JSTOR 27594202.

Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests”. Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.

Rokach, L.; Maimon, O. (2005). “Top-down induction of decision trees classifiers-a survey”. IEEE Transactions on Systems, Man, and Cybernetics, Part C. 35 (4): 476–487. doi:10.1109/TSMCC.2004.843247.

Witten, Ian; Frank, Eibe; Hall, Mark (2011). Data Mining. Burlington, MA: Morgan Kaufmann. pp. 102–103. ISBN 978-0-12-374856-0.

Gareth, James; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2015). An Introduction to Statistical Learning. New York: Springer. p. 315. ISBN 978-1-4614-7137-0.

Mehta1a, Dinesh; Raghavan, Vijay (2002). “Decision tree approximations of Boolean functions”. Theoretical Computer Science. 270 (1–2): 609–623. doi:10.1016/S0304-3975(01)00011-1.

Hyafil, Laurent; Rivest, RL (1976). “Constructing Optimal Binary Decision Trees is NP-complete”. Information Processing Letters. 5 (1): 15–17. doi:10.1016/0020-0190(76)90095-8.

Murthy S. (1998). Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery

Ben-Gal I. Dana A., Shkolnik N. and Singer (2014). “Efficient Construction of Decision Trees by the Dual Information Distance Method” (PDF). Quality Technology & Quantitative Management (QTQM), 11( 1), 133-147.

“Principles of Data Mining”. 2007. doi:10.1007/978-1-84628-766-4. ISBN 978-1-84628-765-7.

Deng,H.; Runger, G.; Tuv, E. (2011). Bias of importance measures for multi-valued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300.

Brandmaier, Andreas M.; Oertzen, Timo von; McArdle, John J.; Lindenberger, Ulman. “Structural equation model trees”. Psychological Methods. 18 (1): 71–86. doi:10.1037/a0030001. PMC 4386908 Freely accessible.

Painsky, Amichai; Rosset, Saharon (2017). “Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance”. IEEE transactions on pattern analysis and machine intelligence. 39 (11): 2142–2153.

http://citeseer.ist.psu.edu/oliver93decision.html

Tan & Dowe (2003)

Papagelis, A.; Kalles, D. (2001). “Breeding Decision Trees Using Evolutionary Techniques” (PDF). Proceedings of the Eighteenth International Conference on Machine Learning, June 28–July 1, 2001. pp. 393–400.

Barros, Rodrigo C.; Basgalupp, M. P.; Carvalho, A. C. P. L. F.; Freitas, Alex A. (2012). “A Survey of Evolutionary Algorithms for Decision-Tree Induction”. IEEE Transactions on Systems, Man and Cybernetics. Part C: Applications and Reviews. 42 (3): 291–312. doi:10.1109/TSMCC.2011.2157494.

Chipman, Hugh A.; George, Edward I.; McCulloch, Robert E. (1998). “Bayesian CART model search”. Journal of the American Statistical Association. 93 (443): 935–948. doi:10.1080/01621459.1998.10473750.

Barros, R. C.; Cerri, R.; Jaskowiak, P. A.; Carvalho, A. C. P. L. F. (2011). “A bottom-up oblique decision tree induction algorithm”. Proceedings of the 11th International Conference on Intelligent Systems Design and Applications (ISDA 2011). doi:10.1109/ISDA.2011.6121697

bottom of page