An important part of the machine learning workflow is the model evaluation. The process itself can be considered common knowledge: split the data into train and test sets, train the model on the train set, and evaluate its performance on the test set using a score function.
The score function (or metric) is a mapping of the ground truth values and their predictions into a single and comparable value [1]. For example, for continuous predictions one could use score functions such as the RMSE, MAE, MAPE or R-squared. But what if the prediction is not a point-wise estimate, but a distribution?