CRISP_DM and Skylab USA
Alfred J. Nigl, Ph.D. and Dean Grey
Skylab USA one of the world’s most robust, white-labeled social media engagement platforms, created to leverage the principles of the science of engagement is collecting a wide variety of user data and information. Skylab data science department has adopted the CRISP_DM method for organizing and compiling data for analyses and reporting. Based on this ongoing data analysis the following types of reports can be created and distributed to industry professionals and media outlets.
|Social Media Platform||Total Monthly Active Users||Brand Engagement %||Valuation|
- Compared to social media platforms with verifiable brand engagement stats, Skylab USA is outperforming Facebook by 730 times and outperforming Instagram by 16 times.
- The reason for these very high brand engagement levels can be explained by Skylab’s adoption of a Value Reinforcement System, based on a modern adaptation of Social Cognitive Learning Theory (see the paper by Nigl and Grey published on Research Gate February 2018).
Skylab is also outperforming most of the apps which have been released in both Apple and Android (Google) stores. The table below shows how Skylab ranks in total downloads as of March 2018.
With a total of 27,989 downloads, Skylab USA is outperforming over 85% of all apps released to date.
In summary, Skylab’s unique VRS system and its highly intelligent gamification platform harnesses the power of social- and self-reinforcement systems to produce very high levels of user engagement and app downloads which place Skylab in the upper decile rankings of both engagement, downloads, and retention.
Data Mining is a critical function that all businesses need to engage in to find and leverage the value of their legacy and current customer data. This function used to be known as Knowledge Discovery and that term is still a good explanatory description for what takes place in data mining.
In 1995, a group of leading data scientists came together for the express purpose of creating a uniform process for conducting data mining independent of the software used and the level of experience of the user. It was designed to be freely available and a recent survey found the over 40% of all data scientists around the world still rely on this process today. The process was named CRISP_DM or Cross Industry Standard Process for Data Mining.
The motivation to create a standardized process included the concern among many data scientists that specialists were dominating the field and their belief in the democratization of data mining. Therefore, if data mining was to be proliferated among various business professionals with no formal data science training, it was considered important to ensure that the data mining process be reliable and repeatable by people with little data mining background.
Secondarily, CRISP_DM also serves as a substitute for the Experimental Method which has its beginnings in traditional physical and social sciences but typically is not formally applied to Data Science.
The graphic below is a representation of the six steps that characterize the CRISP_DM process. The key factors of this model include:
- Process Model which focuses on Business needs
- Can be applied by non-data scientists
- Provides a complete blueprint
- Data Mining methodology
- Life cycle: 6 phases
The important thing to notice about the graphic above is the fact that the process was designed to be focused on Business, not technology or data science. Business understanding is the first step in the process. In fact, many data scientists and other thought leaders in the field of data science have emphasized the fact that any predictive model that is not developed with a strong understanding of the business, is, in fact, useless and not worth deploying.
The figure above also shows that the CRISP_DM process is not unidirectional, information flows from Business Understanding and then the next step Data Understanding can alter one’s perception of the business and the evaluation of any model created must be validated against the Business Understanding. The outer directional arrows form a complete circle showing that this process can involve many iterations or cycles until an effective predictive model is created and deployed.
Skylab follows the CRISP_DM method for data mining and to help organize and prepare its data for analysis and reporting as shown below.
- Skylab Data Mining Outline
Skylab comparison and engagement data is collected and compiled using the CRISP_DM process.
- Business Rule 1: All stats and data analyses must be related to the Skylab entire Hierarchy
- Solar system ( Skylab’s name for re-sellers )
- Planet (Skylab Customers)
- Experience (each planet may have multiple experiences that users can select)
- End user
- Business Rule 2: All Data must be able to be presented in dash board and exported
- For reports to the CMO and Sales team
- For sponsors/investors
- For Data Science analysis and reporting
- For creating Widgets on our apps
- Business Rule 3: Reports should reflect how Skylab clients rank to other social media platforms across the following data dimensions.
- Total App Downloads
- User Engagement
- DAU and MAU totals
- DAU% and MAU % compare to the total ( active user base ) which is defined by users who have been active on the platform in the last 6 months ( this time frame may be adjusted )
- DAU over MAU
- SAU (Super Active Users) is a new metric Skylab is tracking, measuring users who exhibit above average activity over a 30 day period
- SEU (Super Engaged Users) is another new metric which Skylab has developed to measure users with extraordinary long “streaks” or consecutive days on the app (mobile or web), something that is not covered specifically by the DAU/MAU statistics
- Rank to social media
- Rank to skylabs Clients
- Brand Engagement % (how Skylab planets compare to other social media platforms)
- Rank to other platforms
- Rank In skylabs world
- Sustained Engagement / Retenton %
- Based off Wilson’s Law -the 30-10-10 rule
- VRS Index (how users can become better persons)
- Community focused (helping the planet grow and prosper)
- Personal Growth (helping the person grow and prosper)
- Helping others (helping other users grow and prosper)
- Social Engagement behaviors
Phase 1. Business Understanding
- Statement of Business Objective
- Statement of Data Mining Objective
- Statement of Success Criteria
The first part of this process is focused on understanding the project objectives and requirements from a business perspective. Once this is accomplished the data scientist and team will transform this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives
Determine business objectives
- thoroughly understand, from a business perspective, what the client really wants to accomplish; this may entail interviewing the person or team in charge of the project to gain a complete understanding of the specific goals
- during the interview process it is important to note the important factors, at the beginning, that could possibly exert an influence on the outcome of the project
Experienced data scientists also engage in additional fact-finding about all of the factors that should be considered and flesh out the key details as well as key performance indicators (KPIs)
Determine data mining goals
- a business goal states objectives in business terminology
- a data mining goal states project objectives in technical terms
Ex.) the business goal might be: “Increase sales to existing customers.”
a data mining goal: “Predict how many products a specific customer will buy, given their purchases over the past 12-36 months, demographic information (gender, age, salary, geo-location) and the price of the item.”
Produce project plan
- describe the intended plan for achieving the data mining goals and the business goals
- the plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques
Phase 2. Data Understanding
- Explore the Data
- Verify the Quality
- Find Outliers
This phase starts with the initial data collection and other activities necessary for the data science team to become familiar with the data, steps need to be taken to identify data quality problems, missing values and conduct preliminary analyses like creating histograms, scatter plots and descriptive statistics to discover preliminary insights into the data or to detect interesting subsets to form hypotheses for hidden information.
Collect initial data
- work with client IT to provide the data listed in the project resources created in Phase I
- data loading, data cleaning and identification of all variables necessary for data understanding and analysis
- this phase could lead to initial data preparation steps (Phase II)
- if the data is spread across multiple data sources, data integration is an additional issue, either here or in the later data preparation phase
- examine the basic properties of the acquired data
- report on the results
once the initial data exploration and descriptive analyses have been completed the core data mining questions can be developed and formally addressed using querying, visualization and reporting including:
summarization of the distribution of key attributes and simple aggregations
identify the relations between pairs or small numbers of attributes
detail the properties of significant sub-populations, simple statistical analyses
- these steps may address directly the data mining goals
- the outcome of these processes may contribute to or refine the data description and quality reports
- they also may feed into the transformation and imputation of data and other data preparation processes needed
Verify data quality
the final step in Phase II is toexamine the quality of the data, addressing questions such as:
“Is the data complete?”, Are there missing values in the data?”
Phase 3. Data Preparation
- Takes usually over 80% of the total time of the Data Mining process and includes the following steps
- Collection and organization of the data or additional data
- Consolidation and Cleaning
- Data selection
This important phase covers all activities to construct the final dataset from the initial raw data.
Data preparation tasks usually occur multiple times and not in any prescribed order. Tasks include creating data tables, recording key elements of the data, attribute selection as well as transformation and cleaning of data for modeling tools.
- the first part of this phase is very important, the data scientist must decide on the data to be used for analysis; generally, not all of the data available is used.
- criteria for which data to include and which to exclude include how relevant each data point is to the data mining goals, quality and technical constraints such as limits on data volume or data types. For example, zip codes and telephone numbers are frequently excluded
- selection of key attributes as well as selection of important records in a table are often part of this process
- purpose of cleaning the data is to raise the data quality to the level required by the selected analysis techniques; for example modeling software generally does not work with missing cells but open source methods like R or KNIME have automated imputation processes built in to efficiently fill all cells with data
- this process may involve creating clean subsets of the data, the imputation or insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling.
Key Steps in Phase III
- Construct data
this includes data construction operations such as the production of derived attributes, entire new records or transformed values for existing attributes.
- Integrate data
this involves the application of methods in which key information is combined from multiple tables or records to create new records or values that are useful for modeling.
- Format data
formatting data making modifications based on syntax that do not change its meaning, but might be required by the modeling tool; e.g., in logistic regression, continuous data must be re-formatted and transformed into a binary distribution of 0 or 1.
Phase 4. Modeling
- The first step is to select the modeling techniques to use on the cleaned and prepared data, usually more than one model is selected based upon the data mining objective.
- Build models, usually a family of models is selected which best seem to fit the data (e.g., regression models, classification models, unsupervised machine learning models)
- Assess model (rank the models in terms of accuracy)
Various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Some techniques have specific requirements on what type of data can be modeled. Often it is necessary to conduct additional data preparation.
run the modeling tool on the prepared dataset to create one or more models
- Data Scientist than interprets the models according to his/her domain knowledge, the data mining success criteria and the desired test design
- Additional assessment is performed as part of the model validation procedures which may include:
- Lift charts
- AUC/ROC measures
- following the validation phase, the data scientist contacts business analysts and domain experts later in order to discuss the data mining results in the business context
- in this step, the data science team generally focuses on the winning models but also the evaluation phase also takes into account all other results that were produced in the course of the project
Phase 5. Evaluation
- Evaluation of model
how well it performed on test data
- Methods and criteria
depend on model type
- Interpretation of model
important or not, easy or hard depends on algorithm
Thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached
- assesses the degree to which the model meets the business objectives
- seeks to determine if there is some business reason why this model is deficient
- test the model(s) on test applications in the real application if time and budget constraints permit
- also assesses other data mining results generated
- unveil additional challenges, information or hints for future directions
- do a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked
- review the quality assurance issues
- ex) “Did we correctly build the model?”
Determine next steps
- decides how to proceed at this stage
- decides whether to finish the project and move on to deployment if appropriate or whether to initiate further iterations or set up new data mining projects
- include analyses of remaining resources and budget that influences the decisions
Phase 6. Deployment
- Determine how the results need to be utilized
- Who needs to use them?
- How often do they need to be used
- Deploy Data Mining results by
Scoring a database, utilizing results as business rules,
interactive scoring on-line
The knowledge gained will need to be organized and presented in a way that the customer can use it. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.
- in order to deploy the data mining result(s) into the business, takes the evaluation results and concludes a strategy for deployment
- document the procedure for later deployment
Plan monitoring and maintenance
- important if the data mining results become part of the day-to-day business and it environment
- helps to avoid unnecessarily long periods of incorrect usage of data mining results
- needs a detailed on monitoring process
- takes into account the specific type of deployment
Produce final report
- the project leader and his team write up a final report
- may be only a summary of the project and its experiences
- may be a final and comprehensive presentation of the data mining result(s)
- assess what went right and what went wrong, what was done well and what needs to be improved
CRISP_DM Applications at Skylab USA.com
Skylab USA.com recently implemented the CRISP_DM model in order to help organize and process all of the user data it collects and analyzes.
Below is a summary of the user data that is currently being tracked by Skylab USA.com:
Data Points & Stats
- Users Behavioral Profile
- Consistency streaks
- Best streak
- Current streak
- Actions stats
- Times an User has taken an Action defined on the Actions inventory
- Virality stats
- Sign Ups
- Ripple (this is the total “impact” the User has had by sharing the app or sharing content, bringing people into the platform; it includes the “first generation” of sign ups as well as all generations below to infinity, so Ripple will always be >= Sign Ups)
- Consistency streaks
- Tags system
- Users get tagged by smart tags
- Users get tagged by smart tags
- User gamification scores
- Users have a score that is used to sort the Leaderboard
- Users earn points that add to their score by engaging with the app (e.g. liking, sharing, taking actions, posting photos, etc.)
- Badges engine tracking both number of actions as well as consistency streaks
- A significant number of platform actions (i.e. liking, commenting, sharing, following, chatting, etc.) are being tracked
- Any custom Action that is set on the Actions inventory
- Consuming content (i.e. completing a Post)
- Any Training content (i.e. completed Course X)
- Admin Dashboard
- Total Users
- New Users
Data that will be tracked in the future, note: CSW stands for Community Stats Widget, a new functionality that is currently in development and will be added to the app in the near future.
- Total Users [CSW]
- New Users [CSW]
- User Engagement
- MAU / DAU [CSW]
- Average number of sessions per user
- Average time per User
- iOS/Android App downloads to date
- Total number of Countries [CSW]
- Total number of Cities [CSW]
- Use location tags data to be able to report or display locations on a map
- Planet-level: identity tags at the Planet level (e.g. Gender, Generation, etc.)
- Experience-level: TpEs (Tags-per-Experience tags)
- Actions taken
- Total Actions taken [CSW]
- Actions taken by type/orientation:
- Socially responsible actions [CSW] —e.g. 123
- Personally responsible actions [CSW]
- Actions taken taken by category/theme (at Planet level and above) [CSW] —e.g. 657 «Health & Wellness» Actions taken today
- Specific Actions taken (at Experience level) [CSW] —e.g. 234 «Action 1: 20′ Workout» taken today on «Experience A»
- Total channels published (does not include drafts)
- Channels Followed
- Channel unsubscribes
- Total Posts published (does not include drafts)
- Post stats viewed
- Post read (i.e. completed) [CSW]
- Training Programs/Categories
- Total Training Categories published (does not include drafts)
- Total Courses published (does not include drafts)
- Total Lessons published (does not include drafts)
- Top 10 Courses & Lessons that Users are most engaged with.
- Courses Enrolled
- Courses Completed [CSW]
- Courses Completion Ratio (Courses Completed / Courses Enrolled)
- Stats per Categories/Programs (aggregation of Course stats) —Views, Completions, Completions Ratio, Likes, Comments, Shares
- Stats per Courses (aggregation of Lesson stats) —Views, Completions, Completions Ratio, Likes, Comments, Shares
- Stats per Lessons —Views, Completions, Completions Ratio, Likes, Comments, Shares
- Community activity
- User Photos (e.g. Selfies) posted on the RW [CSW]
- Badges won – time period
- Total Messages Sent
- Total Users interacting via Chat
- Revenue / Income Generated through Web Payments and/or In-app Purchases
- Total Revenue: Day, wk, month,year, to date ( or what every the filter allows )
- Channels subscriptions revenue
- Training one-time payments revenue
- Refunds from one-time payments (not subscriptions)
NOTE: stat names with «[CSW]» at the end means that it’s a stat that should be available to be displayed on the Community Stats Widget on the Home Screen. These will be stats that the End User could see, if the Admin chooses to display them on the Community Stats Widget.
Skylab USA.com Data Mining Set Up (Influenced by CRISP_DM)
Stats per Data Point
For each of the data points, the following must be available:
- Current value (e.g. Total Users)
- Delta or % increase/decrease (e.g. Total Users increased 5% in the last month)
- Ability to display current value and delta according to specific time frames: day / week / month / quarter / year
- Data collected and displayed must aggregate from User behavior on all platforms: iOS + Android + Web
- Data “resolution” should be down to the minute (at the very least to the hour). Data must be able to be tracked and reported at least on an hourly basis, if not on a minute-by-minute frequency.
- For the purpose of reporting and data mining, Actions added to the Actions Inventory must be segmented by:
- Personal Responsibility
- Social Responsibility
- Categories (i.e. Topic/Theme the Action relates to)
- Personal Growth
- Health / Wellness
- Education / Training
- Community Development
- Finances / Wealth
- Biz Development
- Nonprofit / Social Activism
- IMPORTANT: Built-in Platform actions (i.e. Liking, Commenting, Sharing, etc.) are to be considered and tracked as «Socially Responsible» actions.
Skylab Genesis Process
CRISP_DM is also used to guide the Genesis process, the process of developing a new customer app and platform (or the new planet). The most important application of CRISP_DM is to understand the overall business model and business purpose. Along with this understanding of the business purpose and goals is the importance of understanding what types of behaviors the customers (i.e., the planet owner) wants their users to engage in while on the app. Skylab’s Genesis team uses the first 3 phases of the CRISP_DM model to make sure that they understand what each business owner (i.e., planet owner) wants to accomplish with the planet from a business process perspective and also what user behaviors will be positively reinforced, modeled and shaped.
References on CRISP-DM
Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22.
Gregory Piatetsky-Shapiro (2002); KDnuggets Methodology Poll
Gregory Piatetsky-Shapiro (2007); KDnuggets Methodology Poll
Óscar Marbán, Gonzalo Mariscal and Javier Segovia (2009); A Data Mining & Knowledge Discovery Process Model. In Data Mining and Knowledge Discovery in Real Life Applications, Book edited by: Julio Ponce and Adem Karahoca, ISBN 978-3-902613-53-0, pp. 438–453, February 2009, I-Tech, Vienna, Austria.
Lukasz Kurgan and Petr Musilek (2006); A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. Volume 21 Issue 1, March 2006, pp 1–24, Cambridge University Press, New York, NY, USA doi: 10.1017/S0269888906000737.
Azevedo, A. and Santos, M. F. (2008); KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.
Have you seen ASUM-DM?, By Jason Haffar, 16 October 2015, SPSS Predictive Analytics, IBM
Pete Chapman (1999); The CRISP-DM User Guide.
Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rüdiger Wirth (2000); CRISP-DM 1.0 Step-by-step data mining guides.
Colin Shearer (2006); First CRISP-DM 2.0 Workshop Held
References on social media platform engagement statistics (p.1)
*1- .24B Billion est. for Total Users/2.13B MAU/ MAU%-https://expandedramblings.com/index.php/by-the-numbers-17-amazing-facebook-stats/
*2- Omnicore Agency https://www.omnicoreagency.com/instagram-statistics/
*4-1.5 Billion for Whatsapp https://expandedramblings.com/index.php/whatsapp-statistics/#.uwd2cpldu0u
*5- Snapchat Daily Active Users – https://www.staista.com/chart/7951/snapchat-user-growth/