Our metric is a combination of Cohen's kappa and the F1-scores. Specifically, we have computed the final score of each team on each test image as: score= Cohen's kappa + (macro-averaged F1-score + micro-averaged F1-score)/2.

The overall score will be computed as the average of score on all test images. Given the large data size and continuous nature of the score, it is extremely unlikely to observe ties among teams. 

The confusion metric is 4-by-4, where each row is normalized to sum up to 1. The four groups are benign and Gleason grades 3,4,5. For example, the first row of the confusion matrix of 1 YujinHu indicates that 95.94% of benign pixels are correctly classified, 0.07% of benign pixels are misclassified to grade 3, 0.98% misclassified to grade 4, and 3.00% misclassified to grade 5.

Rank UserID Score
1 YujinHu 0.845151814 1 YujinHu 95.94%
2 nitinsinghal 0.792585244 11.70%
3 ternaus 0.789663211 9.15%
4 zhangjingmri 0.778060956 24.29%
5 sdsy888 0.759776281
6 cvblab 0.757838005 2 nitinsinghal 82.95%
7 XiaHua 0.716059262 16.60%
8 AlirezaFatemi 0.712536629 16.79%
9 jpviguerasguillen 0.649811671 25.24%
10 qq604395564 0.643760156
11 Unipabs 0.587781206 3 ternaus 90.96%
12 ed15b055 0.530973711 18.79%
13 alxndrkalinin 0.525332123 20.29%
14 bb_mlb_56 0.503856464 25.61%
15 ninashp 0.503679041
16 jinchen0227 0.501883245 4 zhangjingmri 88.32%
17 naiyh 0.454468668 16.79%
21.13%
24.66%
5 sdsy888 88.96%
8.82%
5.61%
6.07%
6 cvblab 72.47%
11.82%
12.53%
24.91%
7 XiaHua 82.89%
5.84%
5.91%
0.06%
8 AlirezaFatemi 94.23%
21.63%
26.87%
24.98%