-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed Dimensions Trick #108
Comments
When creating The code will be slower and less efficient, but as far as I recall it should still go through. |
Thank you for the reply. The code does indeed run after applying your suggestion. However, my validation accuracy is still a little lower than what's reported in the paper. |
Thank you for your interest. We will be updating the arxiv preprint soon to include more details. As far as hyperparemeters go, we used Amsgrad optimizer with a learning rate of 10^3, a batch size of 2^12, and a uniform Xavier initialization for all weights. We train only one epoch. As far as alpha and the base embedding dimension go, it really depends on your GPU memory size. You can fix an alpha in the (0.2,0.3) range and then try a base embedding dimension of, say, 512. If you have leftover memory, you can increase dimension, and if you OOM you can decrease until it fits. By the way, the embedding dim is set with the "arch-sparse-feature-size" flag, and alpha is set with the "md-temperature" flag. Hopefully, this helps! Please let me know if you have any further questions. |
Thank you very much for all the info. I'm still missing some of the architecture info, such as the sizes of the MLP. However, I assume that those are the same as the original DLRM paper and the Compositional Embeddings paper. I might be missing a fundamental understanding with regards to the embedding dimension, because I'm confused about the numbers used. Both these papers use an embedding dimension of 16. However the bottom MLP has sizes 512, 256, and 64. I thought the output size of the bottom MLP is supposed to be equal to the embedding dimension. Am I misunderstanding this? Or is a 4th hidden layer with size 16 implicit? Sorry if this is a little off topic, but I'd appreciate some clarification. Thanks again |
The last output of the bottom MLP should match the embedding size (so in your example it should be 16 and not 64). You can easily adjust this dimensions on the command line. |
Thanks for the quick reply. I know I can adjust the sizes in command line and I can easily figure out what sizes are consistent for the code to run. However, my goal is to reproduce exact results from the papers. Hence why I am trying to use the exact architecture. The sizes I gave are from both the papers. If it's simply a typo, that's fine but strange that it would appear in both. I just want to make sure I'm using the official network. Thanks |
As far as I recall we have used dimension 16 (you will see it here). Can you clarify where did you see dimension 64? |
I was basing this off the papers. Section 5.1 from both https://arxiv.org/pdf/1906.00091.pdf and https://arxiv.org/pdf/1909.02107.pdf. But thanks for pointing me those flags. It looks like a hidden layer of size 16 is added to the 512-256-64 part that's mentioned in those papers. That's what I meant when I asked if the last layer is implied in order that the output size matching embedding dimension. Thanks! |
That's my bad for not looking into dlrm_s_criteo_kaggle.sh. I've just been using run_and_time.sh since I was also interested in the MLPerf benchmarks. But this uses a different architecture, so I was trying to alter it according to the papers. I just got confused as I'm new to these kinds of networks. I think I have everything sorted now. Thank you so much for all the help. |
Sounds good. I'm glad everything is clear. Closing. |
Hi @tginart. Well, I still can't reproduce the mixed dimension results. I am attaching my log file so you can see the set up I'm using. But I'll summarize here the relevant parameters. Embedding Dimension: 16 All other flags are either default, or I took them from dlrm_s_criteo_kaggle.sh. I also changed the optimizer in dlrm_s_pytorch.py to Adam optimizer with amsgrad=True. Consequently I had to change to sparse=False for the embedding bags since Adam doesn't support sparse gradients. At the end of one epoch I have a training accuracy of 79.283% and validation accuracy of 78.882%. The paper shows about 79.5%. I don't think the difference can be attributed to noise, as the magnitude of noise is much smaller 1%. I did try a few different rng seeds, but they all resulted in around the same accuracy. What am I missing here? Thanks again. |
Hi @srcarroll. Please see my original comment: "As far as alpha and the base embedding dimension go, it really depends on your GPU memory size. You can fix an alpha in the (0.2,0.3) range and then try a base embedding dimension of, say, 512. If you have leftover memory, you can increase dimension, and if you OOM you can decrease until it fits. By the way, the embedding dim is set with the "arch-sparse-feature-size" flag, and alpha is set with the "md-temperature" flag." Please try increasing the embedding dimension and let me know how it goes. |
I chose the parameters based off the papers. If I use parameter values other than the ones reported in the paper, then I'm not reproducing the results of the paper. The paper explicitly states that alpha=.3 is used to produce the figure. It's not clear to me what embedding dimension was used, which is why I chose 16 as that was used in the original DLRM paper and Compositional Embeddings paper. I can try increasing the embedding dimension as you suggest, but it still wouldn't give my confidence that I'm using the exact set up used to obtain the results in the paper. Guess-and-checking is a very sub-optimal solution. |
The paper you are referring to is pre-print that we intend to update in the future. In the meantime, I will look into providing you with some exact commands. |
That would be great. Looking forward to it. Thanks for the support. P.S. I did rerun training with a few higher embedding dimensions. The accuracy increased with dimension, but never exceeded 79% |
I'm trying to run dlrm training with the mixed dimensions trick. The paper says that amsgrad is used, so I tried changing the SGD optimizer to Adam and set amsgrad=True. But Adam doesn't support sparse gradients. The code works with SGD, but I don't get the same accuracies reported in the paper. How do you suggest getting amsgrad to work so that one can reproduce results from the paper? Thanks in advance.
The text was updated successfully, but these errors were encountered: