why in transformer we compute for all tokens but then use only the last token for prediction? #586

Ahmedd-Wahdan · 2025-01-09T13:38:05Z

the input is (B,T) to the transformer and the output from the MLP is also (B,T) and we only use the embeddings of the last column to predict the next token why cant we do something with the embeddings of the other tokens? it's my first time learning transformers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why in transformer we compute for all tokens but then use only the last token for prediction? #586

why in transformer we compute for all tokens but then use only the last token for prediction? #586

Ahmedd-Wahdan commented Jan 9, 2025

why in transformer we compute for all tokens but then use only the last token for prediction? #586

why in transformer we compute for all tokens but then use only the last token for prediction? #586

Comments

Ahmedd-Wahdan commented Jan 9, 2025