• 0 Posts
  • 13 Comments
Joined 1 year ago
cake
Cake day: July 1st, 2023

help-circle


  • It’s not possible to remove bias from training datasets at all. You can maybe try to measure it and attempt to influence it with your own chosen set of biases, but that’s as good as it can get for the foreseeable future. And even that requires a world of (possibly immediately unprofitable) work to implement.

    Even if your dataset is “the entirety of the internet and written history”, there will always be biases towards the people privileged enough to be able to go online or publish books and talk vast quantities of shit over the past 30 years.

    Having said that, this is also true for every other form of human information transfer in history. “The history is written by the victors” is an age-old problem when it comes to truth and reality.

    In some ways i’m glad that LLMs are highlighting this problem.




  • Part of this is a symptom of support demands from users. There has been an expectation in software development historically, back from when software was always hideously expensive and limited to companies as users, that errors would be fixed by someone on demand ASAP. We’re all familiar with the IT guy “file a ticket first” signs on offices, or the idiot executive’s demands for a new computer because they filled theirs with malware somehow.

    But now a lot of what software did is web-based and frequently free/freemium. But the customer’s expectations of having their issue fixed ASAP remains. Despite the internet being far from a standardised system of completely intercompatible components. So updates and fixes need to continually be deployed.

    And that’s great for most people, until that expectation extends to the creation of new features, from management and end users alike. Then things start getting pumped out half-finished-at-best because you can just fix the MVP later, right?

    We’re going to get to the backlog sometime… right? We don’t need to keep launching new features every quarter… right?





  • Every time you perform an action like commenting, you expect it to maybe update a few things. The post will increase the number of comments so it updates that, your comment is added to the list so those links are created, your comment is written to the database itself, etc. Each action has a cost, let’s say it costs a dollar every update. Then each comment would cost $3, $1 for each action.

    What if instead of doing 3 things each time you posted a comment, it did 1300 things. And it did the same for everyone else posting a comment. Each comment now costs $1300. You would run out of cash pretty quickly unless you were a billionaire. Using computing power is like spending cash, and lemmy.world are not billionaires.




  • it utilizes the power of attention mechanisms to weigh the relevance of input data

    By applying a technique called supervised fine-tuning across modalities, Meta was able to significantly boost CM3leon’s performance at image captioning, visual QA, and text-based editing. Despite being trained on just 3 billion text tokens, CM3leon matches or exceeds the results of other models trained on up to 100 billion tokens.

    That’s a very fancy way to say they deliberately focussed it on a small set of information they chose, and that they also heavily configure the implementation. Isn’t it?

    This sounds like hard-wiring in bias by accident, and I look forward to seeing comparisons with other models on that…

    From the Meta paper:

    “The ethical implications of image data sourcing in the domain of text-to-image generation have been a topic of considerable debate. In this study, we use only licensed images from Shutterstock.
    As a result, we can avoid concerns related to images ownership and attribution, without sacrificing performance.”

    Oh no. That… that was the only ethical concern they considered? They didn’t even do a language accuracy comparison? Data ethics got a whole 3 sentences?

    For all the self-praise in the paper about state of the art accuracy on low input, and insisting I pronounce “CM3Leon” as Chameleon (no), it would have been interesting to see how well it describes people, not streetlights and pretzels. And how it evaluates text/images generated from outside the cultural context of its pre-defined dataset.