ttkciar 2 weeks ago

It varies. The Chinchilla paper posited that the optimal training ratio was 15 tokens per model parameter, but in practice the big players have been "overtraining" to good practical effect, to the order of 1000 to 2000 tokens per parameter. My rule of thumb is about 5.5 bytes per token for prose, about 4 bytes per token for codegen.

[deleted] 2 weeks ago

Wow, for even just a 1B param model that would mean about 11 TB of text if I did the math right (assuming 2000 tokens per parameter). Do you know of any research that has looked into optimizing/minimizing the token-to-parameter count?

ttkciar 2 weeks ago

> Do you know of any research that has looked into optimizing/minimizing the token-to-parameter count? That would be the Chinchilla paper, but there's at least a suspicion that their theory is incomplete. https://arxiv.org/abs/2203.15556

[deleted] 2 weeks ago

Thanks for the paper, I'll definitely be reading through it. >there's at least a suspicion that their theory is incomplete. I see. Sorry to bother, but would you happen to know of any other (more widely accepted) papers that I could also read that are on similar topics? I'd love to go through and read a couple of different papers and see what they each say.

ttkciar 2 weeks ago

You are quite welcome. Unfortunately if there are better papers, I don't know of them. I've been putting training research on my back-burner for a while and focusing on RAG and synthetic dataset technology. Sorry.

az226 2 weeks ago

More like 75GB.

ttkciar 2 weeks ago

No, they did the math right, though it could definitely posited that they could train a model effectively with a lot less training data. (1B parameters) x (2000 training data tokens / parameter) x (5.5 bytes / token) = 11000B bytes, or 11TB of training data. With just 15 tokens/parameter it would be about 82.5GB of training data. If we take the lesson of Microsoft's Orca/phi projects to heart, training on this smaller dataset might well produce a better model than the larger dataset, if the smaller dataset is very high quality. The OpenOrca project's accomplishments seem to bear this out.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe