Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed Dimensions Trick #108

Closed
srcarroll opened this issue Jul 23, 2020 · 16 comments
Closed

Mixed Dimensions Trick #108

srcarroll opened this issue Jul 23, 2020 · 16 comments

Comments

@srcarroll
Copy link

I'm trying to run dlrm training with the mixed dimensions trick. The paper says that amsgrad is used, so I tried changing the SGD optimizer to Adam and set amsgrad=True. But Adam doesn't support sparse gradients. The code works with SGD, but I don't get the same accuracies reported in the paper. How do you suggest getting amsgrad to work so that one can reproduce results from the paper? Thanks in advance.

@mnaumovfb
Copy link
Contributor

When creating nn.EmbeddingBag(..., sparse=False) can you set sparse argument to False.

The code will be slower and less efficient, but as far as I recall it should still go through.

@srcarroll
Copy link
Author

Thank you for the reply. The code does indeed run after applying your suggestion. However, my validation accuracy is still a little lower than what's reported in the paper.
This is likely due to the fact that I don't know what architecture settings are used, since they aren't stated in the paper. I tried using both the default settings, i.e. using flags from run_and_time.sh, and the settings from the Quotient Remainder paper. From the former I get a validation accuracy of 79.132%, and the latter 79.044% after one epoch. Just eyeballing Figure 6 from the MD paper, it looks like the accuracy is around 79.5%. Is there anywhere that I can find all the official parameter values used in the paper? Thanks again.

@tginart
Copy link
Contributor

tginart commented Jul 27, 2020

Thank you for your interest. We will be updating the arxiv preprint soon to include more details.

As far as hyperparemeters go, we used Amsgrad optimizer with a learning rate of 10^3, a batch size of 2^12, and a uniform Xavier initialization for all weights. We train only one epoch.

As far as alpha and the base embedding dimension go, it really depends on your GPU memory size. You can fix an alpha in the (0.2,0.3) range and then try a base embedding dimension of, say, 512. If you have leftover memory, you can increase dimension, and if you OOM you can decrease until it fits. By the way, the embedding dim is set with the "arch-sparse-feature-size" flag, and alpha is set with the "md-temperature" flag.

Hopefully, this helps! Please let me know if you have any further questions.

@srcarroll
Copy link
Author

Thank you very much for all the info. I'm still missing some of the architecture info, such as the sizes of the MLP. However, I assume that those are the same as the original DLRM paper and the Compositional Embeddings paper.

I might be missing a fundamental understanding with regards to the embedding dimension, because I'm confused about the numbers used. Both these papers use an embedding dimension of 16. However the bottom MLP has sizes 512, 256, and 64. I thought the output size of the bottom MLP is supposed to be equal to the embedding dimension. Am I misunderstanding this? Or is a 4th hidden layer with size 16 implicit? Sorry if this is a little off topic, but I'd appreciate some clarification. Thanks again

@mnaumovfb
Copy link
Contributor

mnaumovfb commented Jul 27, 2020

The last output of the bottom MLP should match the embedding size (so in your example it should be 16 and not 64). You can easily adjust this dimensions on the command line.

@srcarroll
Copy link
Author

Thanks for the quick reply. I know I can adjust the sizes in command line and I can easily figure out what sizes are consistent for the code to run. However, my goal is to reproduce exact results from the papers. Hence why I am trying to use the exact architecture. The sizes I gave are from both the papers. If it's simply a typo, that's fine but strange that it would appear in both. I just want to make sure I'm using the official network. Thanks

@mnaumovfb
Copy link
Contributor

As far as I recall we have used dimension 16 (you will see it here).

Can you clarify where did you see dimension 64?

@srcarroll
Copy link
Author

srcarroll commented Jul 27, 2020

I was basing this off the papers. Section 5.1 from both https://arxiv.org/pdf/1906.00091.pdf and https://arxiv.org/pdf/1909.02107.pdf. But thanks for pointing me those flags. It looks like a hidden layer of size 16 is added to the 512-256-64 part that's mentioned in those papers. That's what I meant when I asked if the last layer is implied in order that the output size matching embedding dimension. Thanks!

@srcarroll
Copy link
Author

That's my bad for not looking into dlrm_s_criteo_kaggle.sh. I've just been using run_and_time.sh since I was also interested in the MLPerf benchmarks. But this uses a different architecture, so I was trying to alter it according to the papers. I just got confused as I'm new to these kinds of networks. I think I have everything sorted now. Thank you so much for all the help.

@mnaumovfb
Copy link
Contributor

Sounds good. I'm glad everything is clear. Closing.

@srcarroll
Copy link
Author

Hi @tginart. Well, I still can't reproduce the mixed dimension results. I am attaching my log file so you can see the set up I'm using. But I'll summarize here the relevant parameters.

Embedding Dimension: 16
MLP Bottom: 512-256-64
MLP Top: 512-256
MD Threshold: 200 (Default. I could not find any mention of this in the paper)
MD Temperature: 0.3 (Default)
MD Round Dims: False (Default)
Learning Rate: .001
Batch Size: 4096
Data Set: Criteo Kaggle

All other flags are either default, or I took them from dlrm_s_criteo_kaggle.sh. I also changed the optimizer in dlrm_s_pytorch.py to Adam optimizer with amsgrad=True. Consequently I had to change to sparse=False for the embedding bags since Adam doesn't support sparse gradients.

At the end of one epoch I have a training accuracy of 79.283% and validation accuracy of 78.882%. The paper shows about 79.5%. I don't think the difference can be attributed to noise, as the magnitude of noise is much smaller 1%. I did try a few different rng seeds, but they all resulted in around the same accuracy.

What am I missing here? Thanks again.
dlrm_md_trick.log

@tginart
Copy link
Contributor

tginart commented Jul 28, 2020

Hi @srcarroll.

Please see my original comment:

"As far as alpha and the base embedding dimension go, it really depends on your GPU memory size. You can fix an alpha in the (0.2,0.3) range and then try a base embedding dimension of, say, 512. If you have leftover memory, you can increase dimension, and if you OOM you can decrease until it fits. By the way, the embedding dim is set with the "arch-sparse-feature-size" flag, and alpha is set with the "md-temperature" flag."

Please try increasing the embedding dimension and let me know how it goes.

@srcarroll
Copy link
Author

I chose the parameters based off the papers. If I use parameter values other than the ones reported in the paper, then I'm not reproducing the results of the paper.

The paper explicitly states that alpha=.3 is used to produce the figure. It's not clear to me what embedding dimension was used, which is why I chose 16 as that was used in the original DLRM paper and Compositional Embeddings paper. I can try increasing the embedding dimension as you suggest, but it still wouldn't give my confidence that I'm using the exact set up used to obtain the results in the paper. Guess-and-checking is a very sub-optimal solution.

@tginart
Copy link
Contributor

tginart commented Jul 28, 2020

The paper you are referring to is pre-print that we intend to update in the future. In the meantime, I will look into providing you with some exact commands.

@srcarroll
Copy link
Author

srcarroll commented Jul 29, 2020

That would be great. Looking forward to it. Thanks for the support.

P.S. I did rerun training with a few higher embedding dimensions. The accuracy increased with dimension, but never exceeded 79%

@mnaumovfb mnaumovfb reopened this Aug 1, 2020
@mnaumovfb
Copy link
Contributor

I wanted to follow up on this issue and mention that @tginart has implemented the following PR #137 Please give it a try and let us know if you are able to reproduce the paper results with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants