Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a Prediction on a url field only does it on the url itself not the actual content #6612

Closed
DrissiReda opened this issue Nov 6, 2024 · 4 comments

Comments

@DrissiReda
Copy link

Describe the bug
I am running predictions using a Model with label-studio-ml-backend. It works correctly if the task contains an inline text like this:

{
  "data": {
     "text": "This is the text to annotate"
  }
}

However it doesn't work properly on

{
  "data": {
     "text_url": "s3://bucket/path/to/file/to/annotate"
  }
}

Expected behavior
Predictions should be run on the content of the file in my url

Environment (please complete the following information):

  • OS: Kubernetes
  • Label Studio Version 1.13.1

Additional context
When switching from text to text_url I also adapted the labeling interface to either include or not the valueType="url"

<View>
  <Labels name="label" toName="text_url">
    <Label value="Technology" background="red"/>
    <Label value="Domain" background="darkorange"/>
    <Label value="Problem" background="orange"/>
    <Label value="Advantage" background="green"/>
  </Labels>

  <Text name="text_url" value="$text_url" valueType="url"/>
</View>

I realized the problem because when I generated a prediction I only got one label on the first 2 characters, which corresponds to s3 which is indeed a Technology.

It would help if there was a way to retrieve directly the contents of my s3 file before running the prediction. Or do I have to implement that manually in my ml-backend.

@heidi-humansignal
Copy link
Collaborator

heidi-humansignal commented Nov 9, 2024

Hello,

I'll look into this more. But for alternative:

Resolve this issue, you need to modify your ML backend's predict method to fetch and use the actual content of the file specified by the URL before generating predictions.
Here's how you can do it:

  1. Use self.get_local_path Method:
    The LabelStudioMLBase class provides a helper method self.get_local_path(url), which downloads the content from the given URL and returns the local file path. This method handles various URI schemes, including s3://.

  2. Modify Your predict Method:
    Update your ML backend's predict method to check for URLs and retrieve the file content.
    Here's an example:

` from label_studio_ml.model import LabelStudioMLBase
import os

class YourModel(LabelStudioMLBase):

   def predict(self, tasks, **kwargs):
       predictions = []
       for task in tasks:
           # Check if 'text' is present in data
           if 'text' in task['data']:
               text = task['data']['text']
           elif 'text_url' in task['data']:
               # Get the local path to the file
               file_path = self.get_local_path(task['data']['text_url'])
               # Read the content from the file
               with open(file_path, 'r', encoding='utf-8') as f:
                   text = f.read()
           else:
               # Handle cases where neither 'text' nor 'text_url' is present
               text = ''

           # Now generate predictions using 'text'
           # For example:
           labels = self.your_prediction_function(text)

           # Build prediction result
           prediction = {
               'result': [
                   {
                       'from_name': 'label',
                       'to_name': 'text_url',
                       'type': 'labels',
                       'value': {
                           'start': 0,
                           'end': len(text),
                           'text': text,
                           'labels': labels
                       }
                   }
               ]
           }
           predictions.append(prediction)
       return predictions`

Thank you,
Abu

Comment by Abubakar Saad
Workflow Run

@DrissiReda
Copy link
Author

Thanks this is what I ended up doing, however I didn't know about the self.get_local_path.
However a few differences:

  • I didn't just compare text or text_url. I've used self.parsed_label_config['label']['inputs'][*]['valueType'] and I treated every field that has "url" as a remote file. This is useful if I have a field that has a different name or if I ever get the multiple fields in the same file working....
  • self.get_local_path requires a second parameter which is task_id I imagine that's the name that is assigned to the locally created file.

I have a question however, since all files are downloaded locally, is there a mechanism that cleans the files? Or do I have to do it manually.

@heidi-humansignal
Copy link
Collaborator

This should be in ml-backend. You should be able able to see the predictions now.

Thank you,
Abu

Comment by Abubakar Saad
Workflow Run

@DrissiReda
Copy link
Author

Yes I can see predictions now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants