If it works for predicting the next token in a very long stream of tokens, why n...

Tarq0n 71 days ago | parent | context | favorite | on: Google's 200M-parameter time-series foundation mod...

If it works for predicting the next token in a very long stream of tokens, why not. The question is what architecture and training regimen it needs to generalize.