-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuning on downstream tasks directly #84
Comments
And when I replaced the existing DINOv2 vit-L/14 (pretrained weights only) with RADIO vit-L/14, the accuracy has dropped. I am not sure if it is caused by the incorrect use of RADIO, which bothers me. |
Hello, yes RADIO is very much designed to be used in a downstream application. We usually keep the backbone frozen and train a task-specific head on top of the shared backbone features. Have you seen the semantic segmentation example? RADIO was also integrated as a backbone in Probe3D. Can I check with you that your inputs into RADIO are RGB values in the [0,1] range? |
Thanks for your quick reply. I followed your instructions to freeze the entire backbone and only train the head, but it didn't work for my task. I added lora to the backbone, which gave me better results, but still inferior to dinov2. And for data preprocessing, I first normalized and regularized the img and replaced the input_conditioner with nn.Identity(). |
One thing that comes to mind is that RADIOv2.5-L is a ViT-L/16 model, not an /14 model. Have you ensured that you're handling that difference in patch sizes properly? For example, running DINOv2 at 448px is equivalent to running RADIOv2.5-L at 512px given that it's the same number of tokens processed by either model +/- some negligible compute. |
Yes, I noticed this diff at first, but I found that in the paper vit-l is vit-l/14 instead of vit-l/16. And the patch size in the code can easily achieve interpolation from 16 to 14, so I used the interpolated patch size (16 -> 14). Will this be the key to the performance degradation? I will try it experimentally. Thank you for your reply. |
Yeah, very possible that interpolating to patch 14 is causing enough of an issue to degrade results. The choice between 14 and 16 is tricky for our models. I suppose I personally prefer 16 because it's a better number for computing. From a modeling standpoint, this choice mostly affects what we call "effective resolution" which is essentially the number of patch rows and columns. So if you have a ViT-L/14 at resolution 448, then it's roughly identical to a ViT-L/16 at resolution 512. Both have an effective resolution of 32x32 in this case. Because you're using DINO-L/14, you'll want to account for the effective resolution when comparing to RADIOv2.5-L/16 by scaling the input resolution by |
Hello, I am also trying to fine-tuning the RADIO backbone, but I don't find the patch embed and pos_embed in ViT. Can I ask how do you solve this? |
@wuyouliaoxi let's use the other issue that you created to help hunt down your problem specifically. |
Hello !
Thanks for this amazing work, I want to know how to use the radio model for fine-tuning on downstream tasks (maybe not classification tasks). For vit-L/14, is it possible to load only the backbone parameters including multiple cls tokens (like loading imagenet pre-trained weights), or is it necessary to load dino/clip head? My downstream task is similar to defect detection. Thank you very much for your reply!
The text was updated successfully, but these errors were encountered: