Machine Learning (ML) model development often takes time and requires technical expertise. As data science enthusiasts, when we acquire a dataset to explore and analyze, we eagerly train and validate it using diverse state-of-the-art models or employing data-centric strategies. It feels incredibly fulfilling when we optimize the model’s performance as if all the tasks have been accomplished.
However, after deploying the model to production, there are plenty of reasons that contribute to lower model performance or degradation.
Photo by Adrien Delforge on Unsplash
#1 The training data is generated through simulation
Data scientists often face limitations in accessing the production data, which results in training the model using simulated or sample data instead. While data engineers bear the responsibility of ensuring the representativeness of the training data in terms of scale and complexity, the training data still deviates to some extent from the production data. There is also a risk of systematic flaws in upstream data processing, such as data collection and labeling. These factors can impact the extraction of additional useful input features or hinder the model’s ability to generalize well.