Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide vocabulary to specify purposes and permissions related to AI training #82

Open
scottkellum opened this issue Dec 8, 2022 · 4 comments

Comments

@scottkellum
Copy link

I’m not sure this is the correct place to file this issue, but I would love for some standardized way to disallow my content (writing, photography, code) from being used in AI training data.

I’m imagining an extension of robots.txt where I can explicitly disallow crawlers that search for AI training data.

OR some sort of standard way to indicate copyright permissions and training usage being included might also be helpful, but probably more complicated.

Ultimately I want people to find my work, but I don’t want it to end up in an AI model for others to make things in the style of my work.

@coolharsh55
Copy link
Collaborator

Hi. Thanks for the proposal. This is an interesting application that I/we hadn't forseen. Of existing vocabs, I think ODRL would be a good option to specify machine-readable licenses to indicate what is permitted and prohibited, and schema.org might suffice to specify types of contents (e.g. images, videos). What is then left is specifying purposes such as training an ML model - for which I don't think there are existing vocabularies.

In DPVCG, we are interested in expanding the DPV to more regulations - such as the EU's AI Act where such purposes are relevant. So I think this can function as an use-case towards the development of AI relevant vocabularies including purposes. For example, https://w3id.org/AIRO#training is a concept from @DelaramGlp's work on AI related risk management that refers to the training phase in AI development lifecycle. In DPV, this can be a category of purpose. You and others are welcome to help with these efforts, or provide such purposes, or have more use-cases/examples.

@coolharsh55 coolharsh55 changed the title Standardized way to disallow content for AI training Provide vocabulary to specify purposes and permissions related to AI training Dec 26, 2022
@coolharsh55 coolharsh55 added this to the dpv v2.1 milestone Apr 13, 2024
@coolharsh55 coolharsh55 moved this to Backlog in dpv 2.1 planning Jul 16, 2024
@coolharsh55 coolharsh55 modified the milestones: dpv v2.1, dpv 2.2 Nov 12, 2024
@coolharsh55
Copy link
Collaborator

We discussed this in Meeting FEB-06 and decided to include in the scope of DPV for providing concepts to represent AI training so that it can be used with vocabularies like ODRL to express policies and agreements that state permissions/prohibitions over AI training. This is a rather complex topic and we don't want to simply state AITraining as a concept as there are nuances to consider e.g. what data is involved, where training is run, whether its training a new model or fine-tuning or even RAG, or whether it is for an open source model. I have noted my thoughts on this here: https://harshp.com/dev/dpv/ai-training and we will continue to discuss this and solicit proposals for what should be included in DPV.

Eventually, we may decide to include composite concepts such as AITrainingPermittedLocally that are expressed as a combination of different DPV concepts - similar to P7012 Privacy Terms, but having the basic concepts is a precursor to that. Such concepts would be helpful to directly indicate the intended behaviour, which could then be explicitly/formally denoted using approaches such as ODRL or others.

@coolharsh55 coolharsh55 added WIP and removed proposal labels Feb 13, 2025
@coolharsh55
Copy link
Collaborator

(using dpvbot:) This was discussed in Meeting 2025-02-13
Question on where these concepts will be defined. Mentioned ISO standards as a potential source.

@coolharsh55
Copy link
Collaborator

I collected some more thoughts on this in a blog post. The summary of it is:

  1. We should consider Training and Processing as overlapping concepts
  2. AI extension already contains some training concepts; so these should create ai:Training as a subclass of ai:Technique, and then further expand it as ai:DataTraining and ai:NonDataTraining based on whether data is always involved or not, with ai:DataTraining being a subclass of Processing so that use of (personal) data is flagged consistently even if using AI terminology
  3. To enable use of the existing concepts in AI extension such as ai:SupervisedLearning we create the proprerty ai:hasTrainingTechnique as a subproperty of ai:hasTechnique to specifically refer to training techniques. If the range of this is a concept from ai:DataTraining, then it means the training is done using data.
  4. We use the existing ai:hasTrainingData to express what data is used for training. This can be personal or non-personal data - as indicated by the concept. The property itself makes no indication of this.
  5. Instead of modelling training, etc. as separate broad concepts, we should use dpv:Process and indicate inside it what data is being used for training, where it takes place, what is permitted/prohibitted, etc. We can create a concept called ai:AIProcess to flag that there is AI involved (somewhere) in the process.
  6. We should reuse existing concepts such as dpv:ProcessingCondition to indicate data is (only) processed on device, with dpv:Location only referring to where the training takes place.

Example: A notice stating Name (personal data) will be used for training to provide personalised recommendations based on informed consent, the training will take place on device, and data will be (only) stored on device, and (optionally) a prohibition data will not be transferred outside the device.

ex:SomeNotice a dpv:AINotice ;
    dpv:hasProcess [
        a ai:AIProcess ;
        ai:hasTrainingData pd:Name ;
        dpv:hasPurpose dpv:ProvidePersonalisedRecommendations ;
        dpv:hasLocation dpv:WithinDevice ;
        dpv:hasProcessingCondition [
            dpv:hasProcessing dpv:Store ;
            dpv:hasLocation dpv:WithinDevice ;
        ] ;
        dpv:hasProhibition [
            a dpv:Prohibition ;
            dpv:hasProcessing dpv:Transfer ;
            dpv:hasLocation dpv:OutsideDevice ; # new concept
        ] ;
        dpv:hasLegalBasis dpv:InformedConsent ;
    ] .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants