Provide vocabulary to specify purposes and permissions related to AI training #82

scottkellum · 2022-12-08T18:22:13Z

I’m not sure this is the correct place to file this issue, but I would love for some standardized way to disallow my content (writing, photography, code) from being used in AI training data.

I’m imagining an extension of robots.txt where I can explicitly disallow crawlers that search for AI training data.

OR some sort of standard way to indicate copyright permissions and training usage being included might also be helpful, but probably more complicated.

Ultimately I want people to find my work, but I don’t want it to end up in an AI model for others to make things in the style of my work.

coolharsh55 · 2022-12-26T07:42:20Z

Hi. Thanks for the proposal. This is an interesting application that I/we hadn't forseen. Of existing vocabs, I think ODRL would be a good option to specify machine-readable licenses to indicate what is permitted and prohibited, and schema.org might suffice to specify types of contents (e.g. images, videos). What is then left is specifying purposes such as training an ML model - for which I don't think there are existing vocabularies.

In DPVCG, we are interested in expanding the DPV to more regulations - such as the EU's AI Act where such purposes are relevant. So I think this can function as an use-case towards the development of AI relevant vocabularies including purposes. For example, https://w3id.org/AIRO#training is a concept from @DelaramGlp's work on AI related risk management that refers to the training phase in AI development lifecycle. In DPV, this can be a category of purpose. You and others are welcome to help with these efforts, or provide such purposes, or have more use-cases/examples.

coolharsh55 · 2025-02-13T09:26:18Z

We discussed this in Meeting FEB-06 and decided to include in the scope of DPV for providing concepts to represent AI training so that it can be used with vocabularies like ODRL to express policies and agreements that state permissions/prohibitions over AI training. This is a rather complex topic and we don't want to simply state AITraining as a concept as there are nuances to consider e.g. what data is involved, where training is run, whether its training a new model or fine-tuning or even RAG, or whether it is for an open source model. I have noted my thoughts on this here: https://harshp.com/dev/dpv/ai-training and we will continue to discuss this and solicit proposals for what should be included in DPV.

Eventually, we may decide to include composite concepts such as AITrainingPermittedLocally that are expressed as a combination of different DPV concepts - similar to P7012 Privacy Terms, but having the basic concepts is a precursor to that. Such concepts would be helpful to directly indicate the intended behaviour, which could then be explicitly/formally denoted using approaches such as ODRL or others.

coolharsh55 · 2025-02-15T00:01:36Z

(using dpvbot:) This was discussed in Meeting 2025-02-13
Question on where these concepts will be defined. Mentioned ISO standards as a potential source.

coolharsh55 · 2025-02-25T11:21:57Z

I collected some more thoughts on this in a blog post. The summary of it is:

We should consider Training and Processing as overlapping concepts
AI extension already contains some training concepts; so these should create ai:Training as a subclass of ai:Technique, and then further expand it as ai:DataTraining and ai:NonDataTraining based on whether data is always involved or not, with ai:DataTraining being a subclass of Processing so that use of (personal) data is flagged consistently even if using AI terminology
To enable use of the existing concepts in AI extension such as ai:SupervisedLearning we create the proprerty ai:hasTrainingTechnique as a subproperty of ai:hasTechnique to specifically refer to training techniques. If the range of this is a concept from ai:DataTraining, then it means the training is done using data.
We use the existing ai:hasTrainingData to express what data is used for training. This can be personal or non-personal data - as indicated by the concept. The property itself makes no indication of this.
Instead of modelling training, etc. as separate broad concepts, we should use dpv:Process and indicate inside it what data is being used for training, where it takes place, what is permitted/prohibitted, etc. We can create a concept called ai:AIProcess to flag that there is AI involved (somewhere) in the process.
We should reuse existing concepts such as dpv:ProcessingCondition to indicate data is (only) processed on device, with dpv:Location only referring to where the training takes place.

Example: A notice stating Name (personal data) will be used for training to provide personalised recommendations based on informed consent, the training will take place on device, and data will be (only) stored on device, and (optionally) a prohibition data will not be transferred outside the device.

ex:SomeNotice a dpv:AINotice ;
    dpv:hasProcess [
        a ai:AIProcess ;
        ai:hasTrainingData pd:Name ;
        dpv:hasPurpose dpv:ProvidePersonalisedRecommendations ;
        dpv:hasLocation dpv:WithinDevice ;
        dpv:hasProcessingCondition [
            dpv:hasProcessing dpv:Store ;
            dpv:hasLocation dpv:WithinDevice ;
        ] ;
        dpv:hasProhibition [
            a dpv:Prohibition ;
            dpv:hasProcessing dpv:Transfer ;
            dpv:hasLocation dpv:OutsideDevice ; # new concept
        ] ;
        dpv:hasLegalBasis dpv:InformedConsent ;
    ] .

coolharsh55 changed the title ~~Standardized way to disallow content for AI training~~ Provide vocabulary to specify purposes and permissions related to AI training Dec 26, 2022

coolharsh55 added scope labels Dec 26, 2022

coolharsh55 added this to the dpv v2.1 milestone Apr 13, 2024

coolharsh55 added help-wanted proposal AI and removed concepts labels Jul 10, 2024

coolharsh55 added this to dpv 2.1 planning Jul 16, 2024

coolharsh55 moved this to Backlog in dpv 2.1 planning Jul 16, 2024

coolharsh55 modified the milestones: dpv v2.1, dpv 2.2 Nov 12, 2024

coolharsh55 added WIP and removed proposal labels Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide vocabulary to specify purposes and permissions related to AI training #82

Provide vocabulary to specify purposes and permissions related to AI training #82

scottkellum commented Dec 8, 2022

coolharsh55 commented Dec 26, 2022

coolharsh55 commented Feb 13, 2025

coolharsh55 commented Feb 15, 2025

coolharsh55 commented Feb 25, 2025

Provide vocabulary to specify purposes and permissions related to AI training #82

Provide vocabulary to specify purposes and permissions related to AI training #82

Comments

scottkellum commented Dec 8, 2022

coolharsh55 commented Dec 26, 2022

coolharsh55 commented Feb 13, 2025

coolharsh55 commented Feb 15, 2025

coolharsh55 commented Feb 25, 2025