Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation Metric Stuck at Zero During YOLOv9m Training #148

Open
ProfessorHT opened this issue Jan 4, 2025 · 15 comments
Open

Validation Metric Stuck at Zero During YOLOv9m Training #148

ProfessorHT opened this issue Jan 4, 2025 · 15 comments
Labels
question Further information is requested

Comments

@ProfessorHT
Copy link

I am trying to train YOLOv9m on a custom dataset, but I am encountering an issue where the validation metric stays at zero throughout the training. I have checked my dataset and configuration files, but I am unsure about the cause of the issue.

Capture

The dataset is structured as follows:

image

images: Contains the image files (PNG)
labels: Contains the annotation files in YOLO format ( e.g.: 0 0.12 0.62 0.05 0.07 )(TXT)

I have modified the image_size in yolo/config/general.yaml to the size I want for training. The file looks like this:

image

Training Command:

python yolo/lazy.py task=train task.data.batch_size=4 task.data.image_size=[512,512] model=v9-m dataset=TMP device=cuda use_wandb=False use_tensorboard=True

Despite having followed the training setup and providing the dataset, the validation metrics remain at zero throughout the entire training process. I am unsure whether it’s an issue with the dataset formatting, configuration, or the training setup itself.

  • Is there anything in the configuration or training command that might be causing the validation metrics to stay at zero?
  • Are there any additional steps I should take to ensure the training process is properly tracking the validation metrics?
@ProfessorHT ProfessorHT added the question Further information is requested label Jan 4, 2025
@akshaypx
Copy link

akshaypx commented Jan 5, 2025

Hi, I am also facing similar issue, when training is initiated it gets stuck on first epoch.

The dataset structure same, i.e. dataset folder in root, which contains images, labels and test folder. Each contains images and labels folder.

I have added data.yaml in /yolo/config/dataset/.

This is the data yaml file contents:

path: dataset 
train: train 
validation: valid 

class_num: 1 
nc: 1 
names: ["tws"] 

The command I used is as mentioned in the docs : python yolo/lazy.py task=train dataset=data use_wandb=True.

Inside train yaml file, the epochs have been set to 2 just to check this issue and it takes more than one hour and crashes(as seen in wandb status) or goes on.

@Adamusen
Copy link
Contributor

Adamusen commented Jan 7, 2025

Hey,

I'm not sure about this, but I happened to notice, that the box coordinates for the mock dataset contain absolute box coordinates instead of relative ones, such as:

"bbox": [
    530.18,
    126.04,
    88.94,
    204.35
],

meanwhile your Data seems to be in relative coordinates:

labels: Contains the annotation files in YOLO format ( e.g.: 0 0.12 0.62 0.05 0.07 )(TXT)

this might explain why the network is unable to learn anything.

@ramonhollands
Copy link
Contributor

I think the default coco bounding box format is indeed x,y,width,height.

(eg https://www.v7labs.com/blog/coco-dataset-guide)
"List of objects with the following information: Object class (e.g., "person," "car"); Bounding box coordinates (x, y, width, height); Segmentation mask (polygon or RLE format); Keypoints and their positions (if available)"

@henrytsui000
Copy link
Member

Hi,

Apologies for the misleading message earlier. The issue is actually caused by the following line:
https://github.com/WongKinYiu/YOLO/blob/fa548dfd7bbf18a0c5f2244183fdeaa60a527e08/yolo/tools/data_loader.py#L107

For .txt format annotations, it should use class_id, x_c, y_c, w, h and then convert to a format like class_id, x1, y1, x1, y2, x2, y2, x2, y1. However, It currently use .txt files as a segmentation format.

I’ll fix this issue soon.

Best regards,
Henry Tsui

@ArgoHA
Copy link

ArgoHA commented Jan 8, 2025

@henrytsui000 I also see a small bug because of which if you have bboxes in coco file - they still will be ignored:
in scale_segmentation

    for anno in annotations:
        category_id = anno["category_id"]
        if "segmentation" in anno:
            print("Here")
            seg_list = [item for sublist in anno["segmentation"] for item in sublist]
        elif "bbox" in anno:
            x, y, width, height = anno["bbox"]
            seg_list = [x, y, x + width, y, x + width, y + height, x, y + height]

You'll never get into elif "bbox" in anno when you have "segmentation": [], in json (as it is). I fixed it with simple change:

if len(anno.get("segmentation"))

Do you want me to create a MR?

@ArgoHA
Copy link

ArgoHA commented Jan 8, 2025

After fixing that I still see very poor results, so something is still off in my case.
upd: I checked images and annotations after preprocessing, everything looks correct, although mosaic scaling might be too aggressive, but that can't be the root of an issue I am facing with very poor accuracy.

@Nico-Rixe-VVB
Copy link

Hi,
Anything new on this topic?
@henrytsui000, I tried your recommendation, but sadly with no success.

@tahsinalamin
Copy link

Facing the same issue. Tried changing class_id, x_c, y_c, w, h to class_id, x1, y1, x1, y2, x2, y2, x2, y1 for .txt files and deleting the train.cache and val.cache files before each run. Still validation metrics are not changing from 0.

@tahsinalamin
Copy link

Facing the same issue. Tried changing class_id, x_c, y_c, w, h to class_id, x1, y1, x1, y2, x2, y2, x2, y1 for .txt files and deleting the train.cache and val.cache files before each run. Still validation metrics are not changing from 0.

Okay, tried with both .txt and .json formats. The issue remains the same. I am training with one class detection. When I begin training, I see AP and AR percentages changing and having some values.

Image

But as soon as validation epoch ends, everything goes to zero. And it mostly remains zero for all subsequent epochs. I see the Boxloss, DFLLoss, and BCEloss changing though. Issue is same for all c,m,s models.

Image

@henrytsui000 any advice?

@agriic
Copy link

agriic commented Jan 17, 2025

Try smaller learning rate. At least for my (small) data set, default 0.01 was way too high and i had similar behaviour. (0.0001 worked great for me)

@Nico-Rixe-VVB
Copy link

@agriic How big is your dataset? With your suggestion of 0.0001 I get the same results as before (all 0s)

@agriic
Copy link

agriic commented Jan 20, 2025

~2000 images, 6 classes, at max 4 objects - all quite small.

@tahsinalamin
Copy link

@agriic I have almost same number of images as yours. Setting LR to 0.0001 helped resolving the issue I was having, but still the values are too poor (<5%).

@ProfessorHT
Copy link
Author

Hi @henrytsui000 ,

Any update on this error?

Best Regards

@RJKNATT100
Copy link

I'l also having an issue on custom dataset training. my pre-trained model works fine. but when i trying to create a custom dataset training i ididnt get a good result.

  • label the image using labelIMG and Label me
  • do we have a guide to train using custom dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants