it utilizes the power of attention mechanisms to weigh the relevance of input data
By applying a technique called supervised fine-tuning across modalities, Meta was able to significantly boost CM3leon’s performance at image captioning, visual QA, and text-based editing. Despite being trained on just 3 billion text tokens, CM3leon matches or exceeds the results of other models trained on up to 100 billion tokens.
That’s a very fancy way to say they deliberately focussed it on a small set of information they chose, and that they also heavily configure the implementation. Isn’t it?
This sounds like hard-wiring in bias by accident, and I look forward to seeing comparisons with other models on that…
From the Meta paper:
“The ethical implications of image data sourcing in the domain of text-to-image generation have been a topic of considerable debate. In this study, we use only licensed images from Shutterstock.
As a result, we can avoid concerns related to images ownership and attribution, without sacrificing performance.”Oh no. That… that was the only ethical concern they considered? They didn’t even do a language accuracy comparison? Data ethics got a whole 3 sentences?
For all the self-praise in the paper about state of the art accuracy on low input, and insisting I pronounce “CM3Leon” as Chameleon (no), it would have been interesting to see how well it describes people, not streetlights and pretzels. And how it evaluates text/images generated from outside the cultural context of its pre-defined dataset.
Wow, the face paint one looks like domestic violence…