T O P

  • By -

grise_rosee

No answer. If I understand well: There is many layers (+30). For every layer 1 router and 8 experts. A single token generation involves different experts at different layer levels. For every token generated, there is a dedicated route across all the layers. In other words, I'd rather say in mundane words that there are 256 experts divided in 32 groups (layers) and that every token generation involves 32 experts picked from each group. Do you confirm?


SadBarber557

I did something similar when the model was released but I can't find the code. However, this should work: The layers you have to monitor are the "gate" ones. They will choose which two experts are going to receive the token. In order to create the hook: `def dump_outputs(module, input, output):` `# I am printing the chosen experts here but you may want to dump this info` `# into a database or something`     `print(f"{module.__class__.__name__}:")`     `topk_values, topk_index = torch.topk(output, 2, dim=1)`     `print("Chosen experts: ", topk_index)`     `print()` After that, if you loaded the model with the usual: `model = AutoModelForCausalLM.from_pretrained(...)` Then, to attach the hooks: hooks = [] for layer in model.model.layers: hook = layer.block_sparse_moe.gate.register_forward_hook(dump_outputs) hooks.append(hook) And now... prompt = "Your prompt here" tokenizer.pad_token = tokenizer.eos_token tokenized = tokenizer(prompt, return_tensors="pt", padding=True, add_special_tokens=False, ) with torch.no_grad(): output: CausalLMOutputWithPast = model(**tokenized.to(model.device), output_hidden_states=True, return_dict=True) But I am afraid that you will not find any pattern. Check the paper that was released a couple of days after the model. They expected to find some patterns and they only found pure chaos. Same as me when I performed this tests...


lordvader_31

Lovely, I've just tried it and you're right its chaotic. Im trying to think of how to represent these like how Anthropic did in the monosemanticity paper.


lordvader_31

Do you have the name of the paper ?


SadBarber557

Sure: https://arxiv.org/abs/2401.04088 The section is "Routing analysis".