SMOKE-YFullstack engineer: hardware to deep learing |
Here's how I trained a GPT in 4 hours on a single NVIDIA RTX a4000(16GB).
I started by implementing nanoGPT by Karpathy Sensei. nanoGPT is a 12-layer GPT2 which was trained on 8 NVIDIA a100(80GB). All I had was a single NVIDIA a4000 which my uni was ready to lend me for a day :(
I stumbled across modded-nanoGPT: A repo trying to speedrun the training of nanoGPT. They "modded" nanoGPT with QK-normalization, rotary embeddings, skip connections, etc...
I applied these changes to my GPT hoping it would reduce my training time:
We can do better!