Do creators agree with CC-BY being satisfied if their contribution is attributed at the model level and not at the output level? #41
Replies: 8 comments 5 replies
-
|
I have the same reservation. The dataset is not the work - it merely contains the works, which need to be individually attributed, else it is a license violation. |
Beta Was this translation helpful? Give feedback.
-
|
When I was a kid, I went to the library and read thousands of books. When I later wrote my first research paper, I credited particular books for particular things I wanted to footnote, and did not give credit to the thousands of books of background reading. For example, I would mention the year that Poland was invaded at the start of World War II, without footnoting it. If I recalled something but thought it needed a footnote, I'd go find a book with that information to cite, even if it wasn't the same book I originally read. In most LLMs, the pre-training data is "a bunch of Common Crawl", and if you want to footnote a particular thing, you can quiz the LLM for a concrete source. Seems like CC-BY would work the same in both situations. BTW the dataset world is working towards full lists of sources, via nesting Croissant metadata files. |
Beta Was this translation helpful? Give feedback.
-
|
I don't believe you can "quiz the LLM for a concrete source" actually. It
is likely to misattribute or invent a source to please the user than
actually preserve the lineage of sources through the transformation into
model weights.
…On Fri, Jun 27, 2025 at 3:21 PM Greg Lindahl ***@***.***> wrote:
When I was a kid, I went to the library and read thousands of books. When
I later wrote my first research paper, I credited particular books for
particular things I wanted to footnote, and did not give credit to the
thousands of books of background reading. For example, I would mention the
year that Poland was invaded at the start of World War II, without
footnoting it. If I recalled something but thought it needed a footnote,
I'd go find a book with that information to cite, even if it wasn't the
same book I originally read.
In most LLMs, the pre-training data is "a bunch of Common Crawl", and if
you want to footnote a particular thing, you can quiz the LLM for a
concrete source.
Seems like CC-BY would work the same in both situations.
BTW the dataset world is working towards full lists of sources, via
nesting Croissant metadata files.
—
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFYTLXV4CYFG3OEX36G6NT3FWRT3AVCNFSM6AAAAACAGVOQBCVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTGNRQGEYTMMY>
.
You are receiving this because you commented.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
|
I do it frequently. By examining the source, you can discover if it was a misattribution or invention. That's the same way you should evaluate any claimed source, even those supplied by humans. |
Beta Was this translation helpful? Give feedback.
-
|
We most certainly are not. See also: https://evolvis.org/~tg/cc.htm |
Beta Was this translation helpful? Give feedback.
-
|
We’re curious what people think! It is not generally possible to trace an output back to specific inputs in a foundation model’s training data; at the same time, attribution may be more possible for certain other uses of data (e.g. Retrieval Augmented Generation). Even in that instance, should such attribution be required? Can it be sufficient to simply provide attribution at the model level? Or should each output of the model also incorporate a form of attribution? |
Beta Was this translation helpful? Give feedback.
-
|
On Wed, 2 Jul 2025, Creative Commons wrote:
It is not generally possible to trace an output back to specific inputs
in a foundation model’s training data
It sufficiently is, people have proven that over and over again,
and the “AI” operators merely put restrictions over the API to the
model so people don’t exploit it as they get aware of the ways to
do that, but raw access to the model generally means that, yes, you
can sufficiently extract “training data”.
|
Beta Was this translation helpful? Give feedback.
-
|
I apply a licence with the intention that it's a document that describes the interaction between me and the person and/or institution that handles the licenced (a better term escapes me but..) product. An "AI" is a software package, an inanimate object lacking any personhood. The reuse of the work happens during the training process, and is something that the creator of the model does. That's one place where attribution must happen. The other place is when such model is used and it regurgitates something for its user. Then in effect the said user is reusing the work, and they must correctly attribute the source. The LLM is but a means of retrieval, (unfaithful/malicious) paraphrasing, and synthesis. It cannot do attribution, as "reusing" or "creation" are not things it may do, unless we're to believe a text editor is reusing my past work to creatively produce a perfect facsimile of it at the moment of use. Anthropomorphising software and computers will cause nothing but confusion and misunderstandings, at best. Anyways, in that light, if the model output excludes attribution, then it's basically either causing its user to engage in plagiarism, if they pass the output as their creation, or else it's causing the creator/operator of the model and its user interface to engage in plagiarism, as those entities are passing the work off as theirs. Attribution is not a formality, it's a means of verification, of tracing the sources of information, and of telling what's the original contribution of a work is. Plagiarism does not merely affect plagiarised authors/creators, but it is a threat to society as it allows for pseudo-experts to build an affect of authority where they merit none on the basis of their knowledge and skills. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
According to the current state of "Using CC-Licensed Works for AI Training" putting a link inside the dataset is sufficient for attribution.
I find this highly surprising and will not use CC-BY for works i want attribution for if that is the recommended interpretation. I am against both biological and silicon chauvinism. I find this idea to lead to a double standard. The biological equivalent would be, if some flesh brained person publishes their browsing history somewhere on their website they are allowed to republish any CC-BY without link to the original or any attribution.
In my understanding if an AI is in any form creating derivatives or reproductions of a CC-BY licensed work it MUST reproduce the correct CC-BY attribution, just like it is with humans. If current AI model approaches can not guarantee that property mathematically they MUST NOT train on CC-BY information.
What do others think about this?
Are you an user of CC-BY (feel free to link some creations) and do you feel happy with a link to a dataset if the AI is trained on your CC-BY creations as attribution?
Beta Was this translation helpful? Give feedback.
All reactions