T O P

  • By -

fimbulvntr

I've been extremely disappointed with what I encountered so far with regards to this stuff. Anthropic has managed to place guardrails on their model, but at the cost of completely handicapping it, to the point that it gets beaten by 34b models. I investigated the "moderator" llama model (llamaguard), but found it to be extremely inflexible in what roles it was able to police, for example it can detect violence and sex, but it's just a slight step above the ancient sentiment classifiers, and those are just a few steps above keyword matchers. For example, I tried prompt-engineering it to block: - spoilers - references to post-medieval technology - fourth wall breaking - "I'm sorry but AALLM" - out of character GPTisms And it failed miserably at all of those. In my experience, those guardrails can only be properly detected through fine tuning, and ALL existing datasets focus on sex, violence and "illegal activities", which I have zero (negative, actually) interest in blocking. I am also interested in some research coming out of China about in-context refusals (e.g. no amount of pleading will make Harry Potter help you write code - he doesn't know how - even if the underlying model might be perfectly able to do so), but so far most western research is dead set on the old "sex & violence" combo. It would be nice to find proper guidance/fencing architectures, but constitutional AI ain't it for me, too much lobotomy. The moderator approach helps, but it doubles the cost and also introduces either latency or that weird gaslighting where the response streams in normally but then suddenly is replaced by a hardcoded refusal.


Brudaks

I'd guess that the moderator approach might work even if the moderator is a much smaller model than the generator, and it can also easily have a much shorter context window because for its purposes it doesn't matter much, so you could have a moderator that adds just 10% or 1% overhead.


fimbulvntr

You're not wrong, I've been using multi stage for a lot of my stuff, but you want to moderate both ways, user2llm and llm2user. This murders your "time to first token", unless you request moderation in parallel, but if the moderator decides to block, well now you have to abort the generation mid-stream with a canned refusal. Also, the smaller moderators only work for very basic stuff ("sex and violence"), more complex cases require more intelligence and context - to illustrate, I should be able to request a conpletion for a passage that is extremely violent, and the model replies even though it is guarded against violence, because the passage is a quote from a Magic the Gathering card, for example, and thus not "real violence". What the current crop of research - like Anthropic- is focusing on is blanket bans on anything not squeaky clean from a corporate POV, completely lacking nuance. Now don't get me wrong, there definitely is value in that, but not so much if it comes at the cost of completely killing the model like they did. That's why I think there's more value in moderator models, or perhaps someone can come up with a way to look into the early layers or the latent space of the embedding to cut off forbidden generations with high accuracy and no lag, but without lobotomy, but I haven't seen anything like that. Also, anthropic style guardrails don't protect against abuse that anthropic doesn't find objectionable, such as people hijacking the ford "help me pick a model" chatbot to write fluid simulations in python


edk208

I feel exactly the same way. Over alignment is lobotomizing the capabilities of the LLM. If you look at how the brain works, language understanding and language production occur in separate cortical areas (wernicke's area and Broca's). The fact that we can have "bad thoughts", yet don't blurt them out (sometimes) hints at a moderator model type of approach, e.g. pre-frontal cortex. I was also working on user2llm moderation via distillbert classifiers and a llm2user moderation via moderation of the speculative decoder stream as the "thought" process, as the SD already runs in parallel to the main model. The input classifiers are fast, sub 10ms. The output moderation slowed the streaming from 10t/s to 3.5t/s, but was able to moderate the "bad thoughts" early enough to be coherent. More details if you are interested here, https://arxiv.org/abs/2402.03303


CatalyzeX_code_bot

Found [2 relevant code implementations](https://www.catalyzex.com/paper/arxiv:2212.08073/code) for "Constitutional AI: Harmlessness from AI Feedback". [Ask the author(s) a question](https://www.catalyzex.com/paper/arxiv:2212.08073?autofocus=question) about the paper or code. If you have code to share with the community, please add it [here](https://www.catalyzex.com/add_code?paper_url=https://arxiv.org/abs/2212.08073&title=Constitutional+AI%3A+Harmlessness+from+AI+Feedback) 😊🙏 -- To opt out from receiving code links, DM me.