When approaching a data science project, it is easy to resort to the format of a traditional pipeline model: establish objectives, capture data, create model, evaluate and validate. However, experience leads us to propose a more complex approach, with a focus on continuous revision that allows early identification of variations in terms of scope and attainment of the project objective.
A reduced vision of the project is incomplete and presents problems because the pipeline forgets aspects from the beginning which may be decisive for the success of the project, meaning that we could find ourselves with the result of “something” which works, but which is useless.
So, what would actually be a valid process when developing a data project with a predictive focus? As I mentioned at the start, the authentic cycle is not consecutive steps but more a feedback loop allowing for early detection of whether the objectives have been defined correctly or whether the data is adequate, for example.
The two approaches that have been compared are as follows:
DEFINITION OF OBJECTIVES
In the case of customer churn, is the objective of the project to detect customer churn? From our point of view, no, the initial starting point is already wrong. The true objective of the project is not “to detect”, which would be the objective of the model, but “to reduce” as a preventive measure, which would be the objective of the project. It is true that this must be detected, but it must be known what is to be done when this is detected, otherwise the information will be useless. A model can function with very high precision, be validated, have completed the complete cycle and function perfectly… but if it is obtained leaving too narrow a margin of time to act, what is the point? It must be borne in mind from the start how it is to be used, for what and how far ahead for it to be useful. It is a subtle detail but makes a difference for the success of the project. Thus when setting objectives, sufficient time must be allowed to analyse the question and correct approaches.
For example in a normal ETL when loading an initial batch to a cloud ecosystem, tools like AWS’s Redshift allows the information to be authenticated and adapted to a defined format, and we are able to see easily if it does not adapt to the format we want. But this requires a subsequent step of cleaning eluded by this previous more automated process: deleting empty values, outliers, biases that must be clarified by the client (for example, if the majority of clients are the same age when they should be more varied, or the opposite). In other words, a check that the data is clean not only in terms of format, but also in terms of content and logic. This is the process we call data quality, one of the first phases within a PoC that validates the information with which we are going to work.
If the data is not clean, now is the time to stop and resolve this. It must always be remembered that the “exit” depends on the “entrance” of data.
CONSTRUCTION OF THE MODEL
Model construction is preceded by an exploration phase that serves as a checking point or intermediate objective of what we aim to achieve, an exploratory analysis. One example of this exploratory analysis is the Titanic dataset, an essential in data science in that with feature engineering, it is possible to obtain additional information that contributes value from one or various variables. An exploratory analysis gives much insight and will gradually replace traditional business intelligence, which is limited to showing or responding to questions that are already known, whilst in this process it is the information that generates the questions. On this point, constant contrast with business is key. Does the information we have make sense? Direct consultation with the source is essential to find out if it has no meaning for the business, if so, then it is necessary to return to the capture point because quite possibly, the data at the start could be wrong.
When constructing the model, it should be taken into account that there are two main algorithm families: supervised and unsupervised. The supervised algorithm is the one where we already know the answer. There is no type of label or advance data with the unsupervised algorithm. I may imagine there are patterns but I want to see what patterns emerge. There is a huge volume of possible algorithms, each one with its pros and cons before each individual case: from one that is very simple to move, very interpretable and probably very direct and computable, but where I cannot capture a very complex pattern; to those that are very complex and difficult to compute, with a huge learning zone which loses the interpretability of what you are doing.
VALIDATION OF THE PREDICTIVE MODEL
Offline validation is ideally the final phase of a pilot, when you confirm whether the predictive capacity of the model is correct. Thus it is most usual to reserve recent data in order to simulate what you would be applying “at that moment” but without it having an actual impact.
If offline validation works and has been correct both for the data scientist as well as the business, then it is time to move to the production stage. This step is essential and therefore must be taken into consideration from the design stage. The script and the model may work to perfection, but cannot be passed to production if they are very complicated in terms of engineering, needing a specific response time that cannot be met … From the very start, it must be presented as it is going to be used in reality. The model may function well with 10GB but will it function with 10TB? A common solution is to create an API of the model and make a horizontal grading. Once in production, it is time to reproduce this validation but in its actual setting. What is the model doing? What is it answering? How long does it take to answer? This is the moment of truth.
When all is functioning in production, an A/B test is carried out to see if the focus has the effect we were expecting. In the case of customer churn, as we saw at the start, in an A/B test we apply the model to one group and to the other the model and possible retention activities and to the other we apply the prediction but these results are only used to see if the model is working correctly, like a placebo in medical trials. Is the model working correctly? Does the group where I undertook activities work better than those where I didn’t?
And this is where we measure and apply the KPI that should have been established from the start, taking into account basic criteria like being measurable and attainable, but they must also be in line with the objectives and aims of the business.
If the KPI is met, the data project is a success. If it is not, then we must ask: Have I been very demanding in establishing KPI? Are the detecting tools good? Can we suggest improvements such as introducing more information creating more variables, creating more complex models for a more complete picture and improving response times?The importance of continuous feedback in a data science project #datascience #bigdata Click To Tweet
As you have read, in a data science project, not everything is equal, linear or standardisable for all projects. Continuous feedback, aligned closely with agile work methodology is key to reaching useful objectives for the business that are the end goals of a project of this type.