By Alfred J Nigl, Ph.D. and Dean Grey
Although the future of using AI and Machine Learning in social media remains in doubt due to the global data privacy legislation, AI and Machine Learning are still being applied by social media businesses to help them understand, segment and organize the vast amounts of data that are being collected daily.
Narayan (2018) detailed six areas where AI (for purposes of simplicity, AI will be used in this paper to represent the combined application of AI with machine learning) is impacting social media marketing.
 Content Creation: In order to keep up with the heavy demand for content to fuel very intense rates of marketing campaigns, (the average brand is launching six campaigns a month), brand marketing professionals are using AI to discover usergenerated content that can be incorporated into a marketing campaign. Twizoo is an example of a company which uses AI to automatically generate market campaign content including usergenerated videos
 Consumer Intelligence: AI methods can uncover hidden gems in user behavior, especially leveraging the value in unstructured data and posting and comment trends and converting these consumer behaviors into actionable marketing tools to drive campaigns. Conversion is a digital consulting company that specializes in gathering market intelligence from social and customer sentiment (“voice of customer data”).
 Customer Service: Chatbots are AI tools that more and more companies are using to replace human operators in live chat situations. Skylab USA recently upgraded its community platform chat system by partnering with Applozic which has the capability of integrating with chatbots in the future.
 Influencer marketing: In his recent article, Narayan is quoted as saying “It’s imperative that brands have more intelligence into how they associate with certain influencers…. “ the use of AI methods will make the matches between brands and influencers stronger than matches made by humans alone. InsightPool is a new platform that searches through 600 Million influencers who have been identified across various social media platforms to find the best matches for influencers that fit a brand’s unique characteristics, personality and goals. Skylab is also researching brand affinities among its platform user base, however, its focus is on microinfluencers not high volume influencers. Skylab has recently completed a series of surveys which investigated the extent to which one planet (i.e., a Skylab separate business unit) had a sufficient number of microinfluencers among its user base and which brands they felt an affinity toward.
 Content Optimization: Many companies have leveraged the market intelligence that AI methods can produce to uncover topics and brand affinities based on customer blogs that are of very high interest among large numbers of consumers. By compiling this information, social media platforms can use this information to create highvalue content that will attract millions of users. Because Skylab is more driven by user behavior not content, this would not be a high priority focus for Skylab research at this time.
 Competitive Intelligence: There are at least two areas where AI methods can help companies identify competitor threats and advances in order to maintain their edge and/or market share; these include using NLP to determine the meaning behind clusters of words and statistical AI to disambiguate data to create actionable insights and also to identify outliers. An important part of this process is pattern matching to differentiate important consumer social media posts and content from the trivial.
Machine Learning to help manage the 3 V’s of Big Data:
The three Vs of Big Data: Volume, Velocity, and Variety have overwhelmed traditional methods of data management and analysis. The rapid growth of Big Data has been associated with the adoption of AI and Machine Learning methods for analyzing the tremendous volume and complexity of data.
In fact, based on a recent blog post (snap.stanford.edu, 2018), every second of every day 3.3 million new posts appear on Facebook and about half a million on Twitter. To help keep track of such a high volume of data, analysts have developed highly sophisticated analytical and modeling methods using such open source tools as R and Python. The algorithms created enable computers to identify patterns and classify them into clusters. Such algorithms are perfectly configured to provide with usable information derived from unstructured data and a mix of video, other media, text messages, posts and poster preferences and with a higher degree of accuracy than possible from a human analyst or marketing specialist.
Scraping and Data Lakes
In order to intelligently process the overwhelming volume and variety of social media activity, webscraping tools have been created to gather all the posts associated with a particular brand, store them in a data lake, from which they can be fed into algorithms to be segmented and clustered into relevant bits of information that can then be used to develop marketing messages and drive new marketing campaigns. A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse (i.e., the traditional Oracletype relational DB) stores data in files or folders, a data lake uses a flat architecture to store data.
Skylab USA’s Approach to Analyzing Big Data – Tableau and Red Shift
In order to more efficiently process, scale and analyze the tremendous amount of data generated by its many unique business units (in the Skylab vernacular, called planets), Skylab has moved its data to the cloud (AWS)and using APIs to connect key data points to Red Shift. Skylab’s main analysis and visualization tool are Tableau, which was chosen because of its robust analytics engine.
The summary document below shows the variety and complexity of the Skylab platform user data which is currently being processed with the Red Shift Tableau integration.

 User Engagement
 Dau and Mau totals
 DAU% and MAU % compare to the total ( active user base ) which is defined by User who has been active on the platform in the last 180 days ( this time frame need to be adjustable )
 DAU over MAU % is a measure of “stickiness”
 Rank to social media
 Rank to skylabs Clients
 Brand Engagement % (this really is a stat we compare to others)
 Rank to other platforms
 Rank In skylabs world
 Sustained Engagement / Retention %
 Based off Wilson’s Law ( the 30/10/10 rule and hopefully we can find more )
 VRS Index ( you becoming a better you ) with weights for calculating an Index Score for each Planet
 Actions 20%
 community focused 33% of the 20%
 Personal Growth 33% of the 20%
 Helping others – 33% of the 20%
 Social Engagement 20%
 Post 20%
 Lesson 20%
 Chat 20%
 Actions 20%
 Retention measurement (Using the standard formula Total users who launched the app the first time, total users who went back on the app at the following intervals (dating to the first time they downloaded the app, the next day (Day 1) and then 3 days after and then 7 days after
 Track the average amount of time users spend on each of the planet apps on a daily and monthly basis
 User Engagement
 Reporting and exporting capabilities
 All Data must be able to be presented in a dashboard and exported
 Business uses
 in a report for the CMO
 For Sponsors
 For Data Science
 for Widgets on our apps
 Leveraging a 3rdparty solution for quick implementation
 Business uses
 All Data must be able to be presented in a dashboard and exported
Community Stats module to provide data to be displayed/reported in different ways


 On an analytics dashboard, for Admins to see on the web
 On widgets, on the home screen of mobile and web apps for End Users to see
 On emails being sent out to End Users and Admins, at each level of the hierarchy (i.e. Experience Admin report, Planet Admin report, etc.)
 As csv exportable files
 Reports for sponsors
 Feeding external analytics systems like Google Analytics or Firebase Analytics
 Providing an API for integrations, allowing other systems to gather data from our stats module
 The module must allow historic reporting
 e.g. how many total users at a time period. today/ last week/ last month (to track growth)

 «Community Stats» Features general logic
 Widgets must work to track/gamify the Experiences as well
 On any Planet, Planet Admins must be able to view the following Widgets displaying Experiences instead of Users:
 Recently Active Widget
 Community Recognition Widget (a.k.a Leaderboard)
 On any Planet, Planet Admins must be able to view the following Widgets displaying Experiences instead of Users:
 On Skylab Planet only
 Community Impact Widget must have the ability to display aggregated stats for the Universe
 Planet Admin must be able to set the following Widgets to display Planets instead of Users
 Recently Active Widget
 Community Recognition Widget (a.k.a. Leaderboard)
 Stats Contests
 Widgets must work to track/gamify the Experiences as well
 3. «Backbone Data warehouse» Feature Requirements
 This is the foundational infrastructure and architecture that must be in place since the beginning, to enable the full business vision and features roadmap to be materialized over time
 Time Estimate for completion ( 23 days )
 Complete Data Points definitions
 Data aggregation rules / consolidation and sorting logic (Experience > Planet > Universe)
 Data transformation from RDS via KAFKA into Redshift
 Test Redshift/Kafka productivity and performance
 Verify connection, Kafka and redshift configurations on Security and Encryption
 Setup Monitoring Environment and integrate it with existing New Relic solution
 Data Reporting Mechanism
 Dashboard (Tableau.com)
 Automated Statistical Reporting (Monthly, Weekly, Ondemand)
 Data export
 Data archiving and deletion mechanism
 Backbone V2
 Integration of https://segment.com/mobile with mobile (iOS, Android)
 Redshift Integration with Segment.io
 This is the foundational infrastructure and architecture that must be in place since the beginning, to enable the full business vision and features roadmap to be materialized over time
KPI/Metrics Definition & Logic
Daily Active Users (DAU)
 DAU should be measured using the same criteria as we use for measuring MAU for the sake of consistency (i.e. opening up the app) or opening up the app and taking at least one action?; we need to decide)
 Note: Facebook’s definition of DAU is calculated by aggregating the number of users who open up the app in a 24 hour period regardless of what else they do, including nothing.
 Minimum
 Anytime frontend sends the request to get info from the backend
 e.g. anytime scrolling up/down through the Home Screen
 Anytime frontend sends the request to get info from the backend
 Mechanism to set user inactive/passive/asleep/suspend
 Boris Nayflish add the details in here
 Mechanism to reengage without forcing a user to log back in
 DAU should be measured on a 24hour basis from the 1st day of each new month to the last day for each individual planet and for Skylab as a whole; the average DAUs for a month can then be calculated and that result can be divided by the total users to get a DAU%
Monthly Active Users (MAU)
 Any User who has been a DAU on the last 30 days at least once
DAU/MAU ratio
 What is the DAU/MAU Ratio?
 The Daily Active Users (DAU) to Monthly Active Users (MAU) Ratio measures the stickiness of your product – that is, how often people engage with your product. DAU is the number of unique users who engage with your product in a one day window. MAU is the number of unique users who engage with your product over a 30day window (usually a rolling 30 days).
 The ratio of DAU to MAU is the proportion of monthly active users who engage with your product in a single day window.
 Advice from VCs: Why DAU/MAU Ratio is critical
 “If there’s one number every founder should always know, it’s the company’s growth rate. That’s the measure of a startup. If you don’t know that number, you don’t even know if you’re doing well or badly. The best thing to measure the growth rate of its revenue. The next best, for startups that aren’t charging initially, is active users. That’s a reasonable proxy for revenue growth because whenever the startup does start trying to make money, their revenues will probably be a constant multiple of active users.” – Paul Graham, VC, and CoFounder of Y Combinator
 “The metrics we start with our total active users (monthly/weekly/daily) it’s growth, alongside any ratios like DAU/MAU or DAU/WAU. These help us understand how frequently active people are in using the products.” – Josh Elman, Partner at Greylock Partners
 “I would argue that the single most telling metric for a great product is how many of them become dedicated, repeat users.” – Andrew Chen, Angel Investor
 How to calculate DAU/MAU Ratio:
 (#) Daily active users / (#) Monthly active users = (%) DAU/MAU Ratio
 The key to calculating DAU/MAU Ratio is defining what ‘active’ is for your product. This could be anything from a purchase (for ecommerce or mobile apps), pages viewed/videos watched/comments (for media/publisher), or product login/usage (for SaaS companies or mobile apps).
 Once you’ve defined ‘active’ for your product, determine the number of unique active users in a 24hr period and also the number of unique active users over the past 30 days (usually based a rolling 30 days). With these two metrics, you can divide DAU by MAU to get the ratio percentage.
 A variation of this metric is to swap MAU with the total number of unique weekly active users (WAU). This gives you the DAU/WAU Ratio.
Super Consistent Users (SCU), Super Engaged (SEU) and Super Loyal (SLU)
 CONSISTENCY – SCU is a measure of User’s consistency (i.e. current streaks)
 3 days
 7days
 30 days
 90 days
 6 months
 1yr
 2 Yrs
 Filters by the % of the total group per category.
 Filter by the total number of users to qualify for each category
 ENGAGEMENT – SEU is a measure of User’s engagement score points earned per time frame
 FREQUENCY – SFU is a measure of User’s number of sessions per day
 2x
 3x
 4x
 5x
 …
 Filter by % and total users in the community
 ATTENTION – SAU (Super Attentive User) is a measure of User’s time per session
 RETENTION/LOYALTY – SLU is a measure of Users with high retention rates (i.e., coming back to the app over time)
 Measured by the following time periods:
 3 days
 7 days
 30 days
 60
 90
 6 months
 1yr
 2yr
 Measured by the following time periods:
 The period of time to calculate SEU can be variable (e.g. of the week, of the month)
PostTableau Predictive Analytics Process
After the data is processed through Tableau and basic visualization reports have been created, Skylab’s data science team uses the intelligence created by Tableau to build predictive models using such tools as KNIME, R, and Python.
The types of algorithms that can be applied to Skylab user data include the following:
 Decision tree
 Random forest
 Logistic regression
 Support vector machine
 Naive Bayes
Definitions and all illustrations below are based on information created by bigdatamade simple.com in a blog by http://bigdatamadesimple.com/10machinelearningalgorithmsknow2018/. There are two main categories of predictive models regression models and classification models.
According to Dr. Jason Brownlee, Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the answer.
Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation. The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.
Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).
The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation.
For example, an email or text can be classified as belonging to one of two classes: “spam“ and “not spam“.
 A classification problem requires that examples be classified into one of two or more classes.
 A classification can have realvalued or discrete input variables.
 A problem with two classes is often called a twoclass or binary classification problem.
 A problem with more than two classes is often called a multiclass classification problem.
 A problem where an example is assigned multiple classes is called a multilabel classification problem.
It is common for classification models to predict a continuous value as the probability of a given example belonging to each output class. The probabilities can be interpreted as the likelihood or confidence of a given example belonging to each class. A predicted probability can be converted into a class value by selecting the class label that has the highest probability.
For example, a specific email of text may be assigned the probabilities of 0.1 as being “spam” and 0.9 as being “not spam”. We can convert these probabilities to a class label by selecting the “not spam” label as it has the highest predicted likelihood.
There are many ways to estimate the skill of a classification predictive model, but perhaps the most common is to calculate the classification accuracy.
The classification accuracy is the percentage of correctly classified examples out of all predictions made.
For example, if a classification predictive model made 5 predictions and 3 of them were correct and 2 of them were incorrect, then the classification accuracy of the model based on just these predictions would be:
Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).
A continuous output variable is a realvalue, such as an integer or floating point value. These are often quantities, such as amounts and sizes.
For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.
 A regression problem requires the prediction of a quantity.
 A regression can have real valued or discrete input variables.
 A problem with multiple input variables is often called a multivariate regression problem.
 A regression problem where input variables are ordered by time is called a time series forecasting problem.
Because a regression predictive model predicts a quantity, the skill of the model must be reported as an error in those predictions.
There are many ways to estimate the skill of a regression predictive model, but perhaps the most common is to calculate the root mean squared error, abbreviated by the acronym RMSE.
For example, if a regression predictive model made 2 predictions, one of 1.5 where the expected value is 1.0 and another of 3.3 and the expected value is 3.0, then the RMSE would be:
Classification predictive modeling problems are different from regression predictive modeling problems.
 Classification is the task of predicting a discrete class label.
 Regression is the task of predicting a continuous quantity.
However as Brownlee points out, there is some overlap between the algorithms for classification and regression; for example:
 A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label.
 A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.
Some algorithms can be used for both classification and regression with small modifications, such as decision trees and artificial neural networks. Some algorithms cannot, or cannot easily be used for both problem types, such as linear regression for regression predictive modeling and logistic regression for classification predictive modeling.
Brownlee (2018) also points out that the way that data scientists evaluate classification and regression predictions varies and does not overlap, for example:

 Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.
 Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.
Types of Machine Learning Algorithms
In a post on February 6, 2017, Rahul Saxena presented an explanation of Bayesian machine learning.
Naive Bayes classifier is a straightforward and powerful algorithm for the classification task. Even if we are working on a data set with millions of records with some attributes, it is suggested to try a Naive Bayes approach.
Naive Bayes classifier gives great results when we use it for textual data analysis. Such as Natural Language Processing or NLP.
To understand the naive Bayes classifier we first need to understand the Bayes theorem.
Bayes theorem named after Rev. Thomas Bayes. It works on conditional probability. Conditional probability is the probability that something will happen, given that something else has already occurred. Using the conditional probability, we can calculate the probability of an event using its prior knowledge.
Below is the formula for calculating the conditional probability.
 P(H) is the probability of hypothesis H being true. This is known as the prior probability.
 P(E) is the probability of the evidence(regardless of the hypothesis).
 P(EH) is the probability of the evidence given that hypothesis is true.
 P(HE) is the probability of the hypothesis given that the evidence is there.
Let’s consider an example from Skylab USA to understand how the above formula of Bayes theorem works.
Problem:
Imagine that there are two types of users identified on a particular planet(term used to designate unique business units on Skylab USA) ; those who have tried to influence others to purchase a brand “D” with two results “Positive” & “Negative.” A test is developed to measure various influencer traits and the test is determined to correctly identify Positive Influencers with 99% accuracy: if you have the positive Brand D traits, you will give test positive 99% of the time. If you don’t have these traits will test negative 99% of the time. If only 3% of all the people have these traits and test gives a particular user a “positive” result, what is the probability that that user actually is a positive Brand D influencer?
For solving the above problem, we will have to use conditional probability.
The probability of people, positively influencing others to buy Brand D, P(D) = 0.03 = 3%
The probability that the test gives “positive” result and a person is a positive influencer P(Pos  D) = 0.99 =99%
Probability of people not being a positive influencer of D, P(~D) = 0.97 = 97%
The probability that test gives “positive” result and the person is not a positive influencer, P(Pos  ~D) = 0.01 =1%
For calculating the probability that the person is a positive influencer i.e, P( D  Pos)we will use Bayes theorem:
We have all the values of numerator but we need to calculate P(Pos):
P(Pos) = P(D, pos) + P( ~D, pos)
= P(posD)*P(D) + P(pos~D)*P(~D)
= 0.99 * 0.03 + 0.01 * 0.97
= 0.0297 + 0.0097
= 0.0394
Let’s calculate, P( D  Pos) = (P(Pos  D) * P(D)) / P(Pos)
= (0.99 * 0.03) / 0.0394
= 0.753807107
So, there is a n approximately 75% chance that the person is actually a positive Brand D influencer.
Naive Bayes Classifier
Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered the most likely class. This is also known as Maximum A Posteriori (MAP).
The MAP for a hypothesis is:
MAP(H)
= max( P(HE) )
= max( (P(EH)*P(H))/P(E))
= max(P(EH)*P(H))
P(E) is evidence probability, and it is used to normalize the result. It remains the same so, removing it won’t affect.
Naive Bayes classifier assumes that all the features are unrelated to each other. Presence or absence of a feature does not influence the presence or absence of any other feature. We can use Wikipedia example for explaining the logic i.e.,
A fruit may be considered to be an apple if it is red, round, and about 4″ in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
In real datasets, we test a hypothesis given multiple evidence(feature). So, calculations become complicated. To simplify the work, the feature independence approach is used to ‘uncouple’ multiple evidence and treat each as an independent one.
P(HMultiple Evidences) = P(E1 H)* P(E2H) ……*P(EnH) * P(H) / P(Multiple Evidences)
Example of Naive Bayes Classifier
For understanding a theoretical concept, the best procedure is to try it on an example. Since I am a pet lover so selected animals as our predicted class.
Types of Naive Bayes Algorithm
Gaussian Naive Bayes
When attribute values are continuous, an assumption is made that the values associated with each class are distributed according to Gaussian i.e., Normal Distribution.
If in our data, an attribute say “x” contains continuous data. We first segment the data by the class and then compute
(mean)
(Variance)
(of each class)
MultiNomial Naive Bayes
MultiNomial Naive Bayes is preferred to use on data that is multinomially distributed. It is one of the standard classic algorithms. Which is used in text categorization (classification). Each event in text classification represents the occurrence of a word in a document.
Bernoulli Naive Bayes
Bernoulli Naive Bayes is used on the data that is distributed according to multivariate Bernoulli distributions.i.e., multiple features can be there, but each one is assumed to be a binaryvalued (Bernoulli, boolean) variable. So, it requires features to be binary valued.
Advantages and Disadvantage of Naive Bayes classifier
Advantages
 Naive Bayes Algorithm is a fast, highly scalable algorithm.
 Naive Bayes can be used for Binary and Multiclass classification. It provides different types of Naive Bayes Algorithms like GaussianNB, MultinomialNB, BernoulliNB.
 It is a simple algorithm that depends on doing a bunch of counts.
 Great choice for Text Classification problems. It’s a popular choice for spam email classification.
 It can be easily trained on a small dataset
Disadvantages
 Naive Bayes can learn individual features importance but can’t determine the relationship among features.
Decision Tree Algorithims
 A decision tree is a decision support tool that uses a treelike graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithmthat only contains conditional control statements.
 Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.
A decision tree is a flowchartlike structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.
A decision tree consists of three types of nodes:[1]
 Decision nodes – typically represented by squares
 Chance nodes – typically represented by circles
 End nodes – typically represented by triangles
Decision trees are commonly used in operations research and operations management. If, in practice, decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a probability model as the best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.
Decision trees, influence diagrams, utility functions, and other decision analysis tools and methods are taught to undergraduate students in schools of business, health economics, and public health, and are examples of operations research or management science methods.
Decision tree building block
MultiNomial Naive Bayes
MultiNomial Naive Bayes is preferred to use on data that is multinomially distributed. It is one of the standard classic algorithms. Which is used in text categorization (classification). Each event in text classification represents the occurrence of a word in a document.
Bernoulli Naive Bayes
Bernoulli Naive Bayes is used on the data that is distributed according to multivariate Bernoulli distributions.i.e., multiple features can be there, but each one is assumed to be a binaryvalued (Bernoulli, boolean) variable. So, it requires features to be binary valued.
Advantages and Disadvantage of Naive Bayes classifier
Advantages
 Naive Bayes Algorithm is a fast, highly scalable algorithm.
 Naive Bayes can be used for Binary and Multiclass classification. It provides different types of Naive Bayes Algorithms like GaussianNB, MultinomialNB, BernoulliNB.
 It is a simple algorithm that depends on doing a bunch of counts.
 Great choice for Text Classification problems. It’s a popular choice for spam email classification.
 It can be easily trained on a small dataset
Disadvantages
 Naive Bayes can learn individual features importance but can’t determine the relationship among features.
Decision Tree Algorithm
 A decision tree is a decision support tool that uses a treelike graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
 Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.
A decision tree is a flowchartlike structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.
A decision tree consists of three types of nodes:[1]
 Decision nodes – typically represented by squares
 Chance nodes – typically represented by circles
 End nodes – typically represented by triangles
Decision trees are commonly used in operations research and operations management. If in practice, decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a probability model as the best choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities.
Decision trees, influence diagrams, utility functions, and other decision analysis tools and methods are taught to undergraduate students in schools of business, health economics, and public health, and are examples of operations research or management science methods.
Decision tree building block
Decision tree elements
Decision trees used in data mining are of two main types:
 Classification tree analysis is when the predicted outcome is the class to which the data belongs.
 Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital).
The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al.[3] Trees used for regression and trees used for classification have some similarities – but also some differences, such as the procedure used to determine where to split.[3]
Some techniques, often called ensemble methods, construct more than one decision tree:
 Boosted trees Incrementally building an ensemble by training each new instance to emphasize the training instances previously mismodeled. A typical example is AdaBoost. These can be used for regressiontype and classificationtype problems.[5][6]
 Bootstrap aggregated (or bagged) decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction.[7]
 A random forest classifier is a specific type of bootstrap aggregating
 Rotation forest – in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features.[8]
A special case of a decision tree is a decision list,[9] which is a onesided decision tree so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node). While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity, permit nongreedy learning methods[10] and monotonic constraints to be imposed.[11]
Decision tree learning is the construction of a decision tree from classlabeled training tuples. A decision tree is a flowchartlike structure, where each internal (nonleaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.
There are many specific decisiontree algorithms. Notable ones include:
 ID3 (Iterative Dichotomiser 3)
 C4.5 (successor of ID3)
 CART (Classification And Regression Tree)
 CHAID (Chisquared Automatic Interaction Detector). Performs multilevel splits when computing classification tree With use of test statistics generates a deep directed acyclic graph of decision rules to solve classification and regression tasks.[4]
 MARS: extends decision trees to handle numerical data better.
 Conditional Inference Trees. The statisticsbased approach that uses nonparametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.[13][14]
ID3 and CART were invented independently at around the same time (between 1970 and 1980)[citation needed], yet follow a similar approach for learning decision tree from training tuples.
References
Rokach, Lior; Maimon, O. (2008). Data mining with decision trees: theory and applications. World Scientific Pub Co Inc. ISBN 9789812771711.
Quinlan, J. R., (1986). Induction of Decision Trees. Machine Learning 1: 81106, Kluwer Academic Publishers
Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 9780412048418.
Ignatov, D.Yu.; Ignatov, A.D. (2017). “Decision Stream: Cultivating Deep Decision Trees”. IEEE ICTAI: 905–912. arXiv:1704.07657 Freely accessible. doi:10.1109/ICTAI.2017.00140.
Friedman, J. H. (1999). Stochastic gradient boosting. Stanford University.
Hastie, T., Tibshirani, R., Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer Verlag.
Breiman, L. (1996). Bagging Predictors. “Machine Learning, 24”: pp. 123140.
Rodriguez, J. J.; Kuncheva, L. I.; Alonso, C. J. (2006). “Rotation forest: A new classifier ensemble method”. IEEE Transactions on Pattern Analysis and Machine Intelligence. 28 (10): 1619–1630. doi:10.1109/TPAMI.2006.211.
Rivest, Ron (Nov 1987). “Learning Decision Lists” (PDF). Machine Learning. 3 (2): 229–246. doi:10.1023/A:1022607331053.
Letham, Ben; Rudin, Cynthia; McCormick, Tyler; Madigan, David (2015). “Interpretable Classifiers Using Rules And Bayesian Analysis: Building A Better Stroke Prediction Model”. Annals of Applied Statistics. 9: 1350–1371. arXiv:1511.01644 Freely accessible. doi:10.1214/15AOAS848.
Wang, Fulton; Rudin, Cynthia (2015). “Falling Rule Lists” (PDF). Journal of Machine Learning Research. 38.
Kass, G. V. (1980). “An exploratory technique for investigating large quantities of categorical data”. Applied Statistics. 29 (2): 119–127. doi:10.2307/2986296. JSTOR 2986296.
Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework”. Journal of Computational and Graphical Statistics. 15 (3): 651–674. doi:10.1198/106186006X133933. JSTOR 27594202.
Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests”. Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.
Rokach, L.; Maimon, O. (2005). “Topdown induction of decision trees classifiersa survey”. IEEE Transactions on Systems, Man, and Cybernetics, Part C. 35 (4): 476–487. doi:10.1109/TSMCC.2004.843247.
Witten, Ian; Frank, Eibe; Hall, Mark (2011). Data Mining. Burlington, MA: Morgan Kaufmann. pp. 102–103. ISBN 9780123748560.
Gareth, James; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2015). An Introduction to Statistical Learning. New York: Springer. p. 315. ISBN 9781461471370.
Mehta1a, Dinesh; Raghavan, Vijay (2002). “Decision tree approximations of Boolean functions”. Theoretical Computer Science. 270 (1–2): 609–623. doi:10.1016/S03043975(01)000111.
Hyafil, Laurent; Rivest, RL (1976). “Constructing Optimal Binary Decision Trees is NPcomplete”. Information Processing Letters. 5 (1): 15–17. doi:10.1016/00200190(76)900958.
Murthy S. (1998). Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery
BenGal I. Dana A., Shkolnik N. and Singer (2014). “Efficient Construction of Decision Trees by the Dual Information Distance Method” (PDF). Quality Technology & Quantitative Management (QTQM), 11( 1), 133147.
“Principles of Data Mining”. 2007. doi:10.1007/9781846287664. ISBN 9781846287657.
Deng,H.; Runger, G.; Tuv, E. (2011). Bias of importance measures for multivalued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300.
Brandmaier, Andreas M.; Oertzen, Timo von; McArdle, John J.; Lindenberger, Ulman. “Structural equation model trees”. Psychological Methods. 18 (1): 71–86. doi:10.1037/a0030001. PMC 4386908 Freely accessible.
Painsky, Amichai; Rosset, Saharon (2017). “CrossValidated Variable Selection in TreeBased Methods Improves Predictive Performance”. IEEE transactions on pattern analysis and machine intelligence. 39 (11): 2142–2153.
http://citeseer.ist.psu.edu/oliver93decision.html
Tan & Dowe (2003)
Papagelis, A.; Kalles, D. (2001). “Breeding Decision Trees Using Evolutionary Techniques” (PDF). Proceedings of the Eighteenth International Conference on Machine Learning, June 28–July 1, 2001. pp. 393–400.
Barros, Rodrigo C.; Basgalupp, M. P.; Carvalho, A. C. P. L. F.; Freitas, Alex A. (2012). “A Survey of Evolutionary Algorithms for DecisionTree Induction”. IEEE Transactions on Systems, Man and Cybernetics. Part C: Applications and Reviews. 42 (3): 291–312. doi:10.1109/TSMCC.2011.2157494.
Chipman, Hugh A.; George, Edward I.; McCulloch, Robert E. (1998). “Bayesian CART model search”. Journal of the American Statistical Association. 93 (443): 935–948. doi:10.1080/01621459.1998.10473750.
Barros, R. C.; Cerri, R.; Jaskowiak, P. A.; Carvalho, A. C. P. L. F. (2011). “A bottomup oblique decision tree induction algorithm”. Proceedings of the 11th International Conference on Intelligent Systems Design and Applications (ISDA 2011). doi:10.1109/ISDA.2011.6121697