dataset vector Transformer
dataset based DeepSpeed implementation for relu environment.
- Input
- 2955-dim embedding
- Encoder
- 121 x Transformer with 36 heads
- Output
- recall projection
Training config
optimizer=Adadelta, lr=0.916, scheduler=cosine, warmup=592