Running a Prediction on a url field only does it on the url itself not the actual content #6612

DrissiReda · 2024-11-06T15:25:34Z

Describe the bug
I am running predictions using a Model with label-studio-ml-backend. It works correctly if the task contains an inline text like this:

{
  "data": {
     "text": "This is the text to annotate"
  }
}

However it doesn't work properly on

{
  "data": {
     "text_url": "s3://bucket/path/to/file/to/annotate"
  }
}

Expected behavior
Predictions should be run on the content of the file in my url

Environment (please complete the following information):

OS: Kubernetes
Label Studio Version 1.13.1

Additional context
When switching from text to text_url I also adapted the labeling interface to either include or not the valueType="url"

<View>
  <Labels name="label" toName="text_url">
    <Label value="Technology" background="red"/>
    <Label value="Domain" background="darkorange"/>
    <Label value="Problem" background="orange"/>
    <Label value="Advantage" background="green"/>
  </Labels>

  <Text name="text_url" value="$text_url" valueType="url"/>
</View>

I realized the problem because when I generated a prediction I only got one label on the first 2 characters, which corresponds to s3 which is indeed a Technology.

It would help if there was a way to retrieve directly the contents of my s3 file before running the prediction. Or do I have to implement that manually in my ml-backend.

The text was updated successfully, but these errors were encountered:

heidi-humansignal · 2024-11-09T17:36:57Z

Hello,

I'll look into this more. But for alternative:

Resolve this issue, you need to modify your ML backend's predict method to fetch and use the actual content of the file specified by the URL before generating predictions.
Here's how you can do it:

Use self.get_local_path Method:
The LabelStudioMLBase class provides a helper method self.get_local_path(url), which downloads the content from the given URL and returns the local file path. This method handles various URI schemes, including s3://.
Modify Your predict Method:
Update your ML backend's predict method to check for URLs and retrieve the file content.
Here's an example:

` from label_studio_ml.model import LabelStudioMLBase
import os

class YourModel(LabelStudioMLBase):

   def predict(self, tasks, **kwargs):
       predictions = []
       for task in tasks:
           # Check if 'text' is present in data
           if 'text' in task['data']:
               text = task['data']['text']
           elif 'text_url' in task['data']:
               # Get the local path to the file
               file_path = self.get_local_path(task['data']['text_url'])
               # Read the content from the file
               with open(file_path, 'r', encoding='utf-8') as f:
                   text = f.read()
           else:
               # Handle cases where neither 'text' nor 'text_url' is present
               text = ''

           # Now generate predictions using 'text'
           # For example:
           labels = self.your_prediction_function(text)

           # Build prediction result
           prediction = {
               'result': [
                   {
                       'from_name': 'label',
                       'to_name': 'text_url',
                       'type': 'labels',
                       'value': {
                           'start': 0,
                           'end': len(text),
                           'text': text,
                           'labels': labels
                       }
                   }
               ]
           }
           predictions.append(prediction)
       return predictions`

Thank you,
Abu

Comment by Abubakar Saad
Workflow Run

DrissiReda · 2024-11-11T10:15:20Z

Thanks this is what I ended up doing, however I didn't know about the self.get_local_path.
However a few differences:

I didn't just compare text or text_url. I've used self.parsed_label_config['label']['inputs'][*]['valueType'] and I treated every field that has "url" as a remote file. This is useful if I have a field that has a different name or if I ever get the multiple fields in the same file working....
self.get_local_path requires a second parameter which is task_id I imagine that's the name that is assigned to the locally created file.

I have a question however, since all files are downloaded locally, is there a mechanism that cleans the files? Or do I have to do it manually.

heidi-humansignal · 2024-11-12T23:33:04Z

This should be in ml-backend. You should be able able to see the predictions now.

Thank you,
Abu

Comment by Abubakar Saad
Workflow Run

DrissiReda · 2024-11-13T08:51:12Z

Yes I can see predictions now.

heidi-humansignal added the community_needs-information label Nov 9, 2024

DrissiReda closed this as completed Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running a Prediction on a url field only does it on the url itself not the actual content #6612

Running a Prediction on a url field only does it on the url itself not the actual content #6612

DrissiReda commented Nov 6, 2024

heidi-humansignal commented Nov 9, 2024 •

edited by AbubakarSaad

Loading

DrissiReda commented Nov 11, 2024

heidi-humansignal commented Nov 12, 2024

DrissiReda commented Nov 13, 2024

Running a Prediction on a url field only does it on the url itself not the actual content #6612

Running a Prediction on a url field only does it on the url itself not the actual content #6612

Comments

DrissiReda commented Nov 6, 2024

heidi-humansignal commented Nov 9, 2024 • edited by AbubakarSaad Loading

DrissiReda commented Nov 11, 2024

heidi-humansignal commented Nov 12, 2024

DrissiReda commented Nov 13, 2024

heidi-humansignal commented Nov 9, 2024 •

edited by AbubakarSaad

Loading