The Trifecta of Data Science: Pilot, Productionise, Manage

The Trifecta of Data Science: Pilot, Productionise, Manage

Author: Gareth Martin, Data Science & Artificial Intelligence Director at Altius

Note: This post originally appeared on LinkedIn here

In order for an organisation to be successful with data science, they must be able to carry out all three of the phases of the trifecta in order to realise long-term value.  The three required capabilities are;

  1. Pilot Development
  2. Productionising
  3. Solution Management

If you can already execute on these three, then you are ahead of the game.  If you can’t then here is some guidance on how to get there.

Pilot Development:
Data scientists working within or for your organisation should have access to a secure platform on which they can develop data science pilots.  In considering the tooling for the piloting phase, ease of subsequent deployment and management should be considered.  So it is advisable not only to give your data scientists the required tooling, but also have some guidelines for them on what to do, and what not to do, in order to ease the transition to deployment at a later phase.  For example, if you plan to deploy on a cloud platform, as our team generally does, then the community developing the pilots should work on the same platform and use the same tools as will be integrated for deployment.

Typically, data scientists can develop pilot models in isolation of other engineering roles in the IT team.  When it comes to deployment this is not the case (at least not usually), so education of the data science team on the complexity of deployment is also key.

Productionising:
We use the term productionising to cover an array of activities that are required in order to deploy the model securely into an integrated environment where it can be executed automatically within a workflow in a performant and reliable manner without manual intervention.  And ensuring that deployment and execution can be monitored and managed in as painless a way as possible by the team managing the solutions once they are live.

In deployment, it’s important to consider the computing resources required for the execution of the model and the profile of those as well as the optimal approach to provisioning and elasticity to control costs of execution, particularly in the cloud environments.  Information governance and security should be considered and best practices from the organisation should be applied to ensure compliance.  Integration with source and target systems through API’s or other methods should be part of the core deployment project in order to ensure contracts with other systems are meeting requirements.  Finally, end-to-end testing of the solution and of the monitoring and management tooling associated with it should be considered a high priority.  Often these data science models are tasked with a decision point and as such should be rigorously tested.

Live solutions should be secure, performant and financially viable.

Solution Management:
Just like any other solution in a production environment, data science models should be monitored and managed actively.  There are hundreds of horror stories of live models running un-monitored which have cost companies significantly as they degrade or start to deviate significantly from their targets without any intervention.

As part of the deployment process, the data scientists should be defining the acceptable thresholds of model performance with input from the business solution owners.  Monitoring mechanisms should then be created to ensure that when thresholds are hit that warning or alerts are triggered or that specific events processes are initiated.

The team managing the solutions should have the skills to investigate errors and to retrain models as they degrade.  To carry out these tasks, skills in languages including R and Python are typically required, and understanding of statistical analytical approaches and algorithms are essential.  As such cross-training, your managed services team is necessary or perhaps augmenting them with new resources specifically with data science skills.

If you manage to get these three elements across the line you will be in a great position to enable teams from across your organisation to develop and productionise high-value data science solutions in a rapid and secure manner.