AI Projects Are Hard to Scale

Gartner research shows only 53% of projects make it from artificial intelligence (AI) prototypes to production. There are two reasons for that: First, in the midst of frequently overhyped expectations, a clear path toward the real value for the organization is often not defined for the initial project. The second reason, which is even more important and often ignored: The technical gap between a shiny prototype and putting the results of that prototype into production is big. Bridging that gap between the creation of a combination of data wrangling and model optimization through to deploying that process often requires a complex, sometimes even manual step. Worse, the technologies used are seldom aligned well. That’s what makes it hard to reliably put the results into production at scale.

In order to scale successfully, data science platforms need an integrated deployment approach that covers the creation of AI models in addition to data ingestion and transformation and can automatically move that into a production environment (“deploy” the data processing and models).

If you are looking to scale your AI projects, the chances are one or two of the statements below will resonate with you:

Handle multiple technologies: So far, you haven’t been able to find a solution that is able to mix and match technologies (R, Python, Spark, TensorFlow, on-prem, cloud, hybrid). No need to change everything just because you are moving some data from your on-prem database into the cloud.
Consistent tooling: You want to use the exact same set of tools during creation as well as deployment. No late-night surprises when something that worked great during creation is not available on the production side.
Automatic deployment as an application or a scheduled job is important. The deployment step also needs to be able to adapt to change automatically without the need for manual/intermediate steps.
Rollback: You need the ability to roll back to previous versions of the data science production process to ensure reliability. Of course, problems of deployed versions should have been caught a lot earlier in your existing test and validation setup, right?
Backward compatibility is a must. You need to be able to run both creation as well as production processes years later with guaranteed backward compatibility to ensure results are reproducible and your processes are auditable.
Agility: If you need to revise the data science process, it needs to be deployed instantaneously. No need to wait for someone else to recode or deploy your process manually. Of course, that automatic deployment ought to have gone through an automated test and validation step as well.

Adding AutoML to the Mix

Once you have your integrated deployment tool in place, adding an automated machine learning solution can help your data science practice as well. Sometimes, fine tuning the modeling pieces is not a necessity; ensuring a solid performer, picked automatically, is good enough.

At KNIME, we have created a couple of AutoML blueprints and also complete boxed-up components, which you can add to your integrated deployment process. Blueprints can serve as a starting point for that, while the components flexibly automate the training, validation and deployment of automatically optimized machine learning algorithms. A third option is our guided applications for AutoML that go through the various steps of the AutoML process, based on user settings, and ultimately select the best available model automatically and export it as a ready-to-use scoring workflow. You choose based on what your organization needs: full or guided automation or an automation component inside your otherwise customized data science workflow.

Handling Change

The real world is scary! Changes in the outside world, which cause your data to change, will affect your AI models, proving it even more vital to have an approach that enables you to continuously monitor your models’ performances, adapt them automatically or trigger data science interference when needed, and then get the model back into production quickly, easily and automatically.

Wrapping Up

Data science is growing up. Just like software engineering had to add continuous integration and deployment architectures to become truly professional, data science platforms need to support continuous adjustments of the data processing and modeling and subsequent deployment without any gap: no switching technologies, no loss of options, no manual interference.

Author: Michael Berthold

Michael Berthold is CEO and co-founder at KNIME, an open source data analytics company. He has more than 25 years of experience in data science, working in academia, most recently as a full professor at Konstanz University (Germany) and previously at University of California, Berkeley and Carnegie Mellon, and in… View full profile ›