Since July, have you ever seen how significantly better your AI mannequin has turn into? Measuring them is tough to do. All we are able to do is quantify the vibe : is that this one higher than that one?
Elo is a rating that measures how typically one mannequin wins in opposition to one other, as judged by a human. Which mannequin solutions the immediate : “Describe the variations in texture between a Pink Girl and a Macoun apple” higher? The one with the upper Elo rating.1
Within the final 4 months, the highest 100 fashions have improved their Elo by about 60 factors, with the highest fashions now at 1339 vs 1287 in July.
The largest efficiency beneficial properties occurred on the heart a part of the distribution. Researchers have pushed considerably extra efficiency with improvements in algorithms.
Mannequin Dimension | Win Chance Improve (%) | Definition |
---|---|---|
Small | 32.0% | < 10b parameters |
Medium | 22.4% | 10b – 100b parameters |
Massive | 29.6% | 100 – 200b parameters |
Mega | 25.9% | 200b+ parameters |
The smallest fashions have elevated efficiency most. October fashions have elevated their win charges by almost a 3rd in 4 months. The entire fashions have improved their aggressive win charges by greater than 20%.
In July, we posed the query : what occurs when mannequin efficiency asymptotes? Progress in small, medium, & massive fashions is linear in Elo-terms.
However the mega fashions present extra information factors of inflection, suggesting the current improvements in reasoning & scale (the most important fashions have grown from 200b parameters to greater than 400b) have produced the start of a brand new high-growth S-curve.
1 See the Bradley-Terry mannequin.