Mixed Dimensions Trick #108

srcarroll · 2020-07-23T18:48:21Z

I'm trying to run dlrm training with the mixed dimensions trick. The paper says that amsgrad is used, so I tried changing the SGD optimizer to Adam and set amsgrad=True. But Adam doesn't support sparse gradients. The code works with SGD, but I don't get the same accuracies reported in the paper. How do you suggest getting amsgrad to work so that one can reproduce results from the paper? Thanks in advance.

mnaumovfb · 2020-07-24T05:32:05Z

When creating nn.EmbeddingBag(..., sparse=False) can you set sparse argument to False.

The code will be slower and less efficient, but as far as I recall it should still go through.

srcarroll · 2020-07-24T18:17:53Z

Thank you for the reply. The code does indeed run after applying your suggestion. However, my validation accuracy is still a little lower than what's reported in the paper.
This is likely due to the fact that I don't know what architecture settings are used, since they aren't stated in the paper. I tried using both the default settings, i.e. using flags from run_and_time.sh, and the settings from the Quotient Remainder paper. From the former I get a validation accuracy of 79.132%, and the latter 79.044% after one epoch. Just eyeballing Figure 6 from the MD paper, it looks like the accuracy is around 79.5%. Is there anywhere that I can find all the official parameter values used in the paper? Thanks again.

tginart · 2020-07-27T03:07:41Z

Thank you for your interest. We will be updating the arxiv preprint soon to include more details.

As far as hyperparemeters go, we used Amsgrad optimizer with a learning rate of 10^3, a batch size of 2^12, and a uniform Xavier initialization for all weights. We train only one epoch.

As far as alpha and the base embedding dimension go, it really depends on your GPU memory size. You can fix an alpha in the (0.2,0.3) range and then try a base embedding dimension of, say, 512. If you have leftover memory, you can increase dimension, and if you OOM you can decrease until it fits. By the way, the embedding dim is set with the "arch-sparse-feature-size" flag, and alpha is set with the "md-temperature" flag.

Hopefully, this helps! Please let me know if you have any further questions.

srcarroll · 2020-07-27T21:22:49Z

Thank you very much for all the info. I'm still missing some of the architecture info, such as the sizes of the MLP. However, I assume that those are the same as the original DLRM paper and the Compositional Embeddings paper.

I might be missing a fundamental understanding with regards to the embedding dimension, because I'm confused about the numbers used. Both these papers use an embedding dimension of 16. However the bottom MLP has sizes 512, 256, and 64. I thought the output size of the bottom MLP is supposed to be equal to the embedding dimension. Am I misunderstanding this? Or is a 4th hidden layer with size 16 implicit? Sorry if this is a little off topic, but I'd appreciate some clarification. Thanks again

mnaumovfb · 2020-07-27T21:26:03Z

The last output of the bottom MLP should match the embedding size (so in your example it should be 16 and not 64). You can easily adjust this dimensions on the command line.

srcarroll · 2020-07-27T21:32:13Z

Thanks for the quick reply. I know I can adjust the sizes in command line and I can easily figure out what sizes are consistent for the code to run. However, my goal is to reproduce exact results from the papers. Hence why I am trying to use the exact architecture. The sizes I gave are from both the papers. If it's simply a typo, that's fine but strange that it would appear in both. I just want to make sure I'm using the official network. Thanks

mnaumovfb · 2020-07-27T21:38:57Z

As far as I recall we have used dimension 16 (you will see it here).

Can you clarify where did you see dimension 64?

srcarroll · 2020-07-27T21:50:34Z

I was basing this off the papers. Section 5.1 from both https://arxiv.org/pdf/1906.00091.pdf and https://arxiv.org/pdf/1909.02107.pdf. But thanks for pointing me those flags. It looks like a hidden layer of size 16 is added to the 512-256-64 part that's mentioned in those papers. That's what I meant when I asked if the last layer is implied in order that the output size matching embedding dimension. Thanks!

srcarroll · 2020-07-27T22:19:07Z

That's my bad for not looking into dlrm_s_criteo_kaggle.sh. I've just been using run_and_time.sh since I was also interested in the MLPerf benchmarks. But this uses a different architecture, so I was trying to alter it according to the papers. I just got confused as I'm new to these kinds of networks. I think I have everything sorted now. Thank you so much for all the help.

mnaumovfb · 2020-07-27T22:50:22Z

Sounds good. I'm glad everything is clear. Closing.

srcarroll · 2020-07-28T18:38:12Z

Hi @tginart. Well, I still can't reproduce the mixed dimension results. I am attaching my log file so you can see the set up I'm using. But I'll summarize here the relevant parameters.

Embedding Dimension: 16
MLP Bottom: 512-256-64
MLP Top: 512-256
MD Threshold: 200 (Default. I could not find any mention of this in the paper)
MD Temperature: 0.3 (Default)
MD Round Dims: False (Default)
Learning Rate: .001
Batch Size: 4096
Data Set: Criteo Kaggle

All other flags are either default, or I took them from dlrm_s_criteo_kaggle.sh. I also changed the optimizer in dlrm_s_pytorch.py to Adam optimizer with amsgrad=True. Consequently I had to change to sparse=False for the embedding bags since Adam doesn't support sparse gradients.

At the end of one epoch I have a training accuracy of 79.283% and validation accuracy of 78.882%. The paper shows about 79.5%. I don't think the difference can be attributed to noise, as the magnitude of noise is much smaller 1%. I did try a few different rng seeds, but they all resulted in around the same accuracy.

What am I missing here? Thanks again.
dlrm_md_trick.log

tginart · 2020-07-28T18:49:03Z

Hi @srcarroll.

Please see my original comment:

"As far as alpha and the base embedding dimension go, it really depends on your GPU memory size. You can fix an alpha in the (0.2,0.3) range and then try a base embedding dimension of, say, 512. If you have leftover memory, you can increase dimension, and if you OOM you can decrease until it fits. By the way, the embedding dim is set with the "arch-sparse-feature-size" flag, and alpha is set with the "md-temperature" flag."

Please try increasing the embedding dimension and let me know how it goes.

srcarroll · 2020-07-28T19:02:06Z

I chose the parameters based off the papers. If I use parameter values other than the ones reported in the paper, then I'm not reproducing the results of the paper.

The paper explicitly states that alpha=.3 is used to produce the figure. It's not clear to me what embedding dimension was used, which is why I chose 16 as that was used in the original DLRM paper and Compositional Embeddings paper. I can try increasing the embedding dimension as you suggest, but it still wouldn't give my confidence that I'm using the exact set up used to obtain the results in the paper. Guess-and-checking is a very sub-optimal solution.

tginart · 2020-07-28T19:40:39Z

The paper you are referring to is pre-print that we intend to update in the future. In the meantime, I will look into providing you with some exact commands.

srcarroll · 2020-07-29T06:16:50Z

That would be great. Looking forward to it. Thanks for the support.

P.S. I did rerun training with a few higher embedding dimensions. The accuracy increased with dimension, but never exceeded 79%

mnaumovfb · 2020-10-19T18:05:31Z

I wanted to follow up on this issue and mention that @tginart has implemented the following PR #137 Please give it a try and let us know if you are able to reproduce the paper results with it.

mnaumovfb closed this as completed Jul 27, 2020

mnaumovfb reopened this Aug 1, 2020

mnaumovfb closed this as completed Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixed Dimensions Trick #108

Mixed Dimensions Trick #108

srcarroll commented Jul 23, 2020

mnaumovfb commented Jul 24, 2020

srcarroll commented Jul 24, 2020

tginart commented Jul 27, 2020

srcarroll commented Jul 27, 2020

mnaumovfb commented Jul 27, 2020 •

edited

Loading

srcarroll commented Jul 27, 2020

mnaumovfb commented Jul 27, 2020

srcarroll commented Jul 27, 2020 •

edited

Loading

srcarroll commented Jul 27, 2020

mnaumovfb commented Jul 27, 2020

srcarroll commented Jul 28, 2020

tginart commented Jul 28, 2020

srcarroll commented Jul 28, 2020

tginart commented Jul 28, 2020

srcarroll commented Jul 29, 2020 •

edited

Loading

mnaumovfb commented Oct 19, 2020

Mixed Dimensions Trick #108

Mixed Dimensions Trick #108

Comments

srcarroll commented Jul 23, 2020

mnaumovfb commented Jul 24, 2020

srcarroll commented Jul 24, 2020

tginart commented Jul 27, 2020

srcarroll commented Jul 27, 2020

mnaumovfb commented Jul 27, 2020 • edited Loading

srcarroll commented Jul 27, 2020

mnaumovfb commented Jul 27, 2020

srcarroll commented Jul 27, 2020 • edited Loading

srcarroll commented Jul 27, 2020

mnaumovfb commented Jul 27, 2020

srcarroll commented Jul 28, 2020

tginart commented Jul 28, 2020

srcarroll commented Jul 28, 2020

tginart commented Jul 28, 2020

srcarroll commented Jul 29, 2020 • edited Loading

mnaumovfb commented Oct 19, 2020

mnaumovfb commented Jul 27, 2020 •

edited

Loading

srcarroll commented Jul 27, 2020 •

edited

Loading

srcarroll commented Jul 29, 2020 •

edited

Loading