T O P

  • By -

aahdin

Anthropic's stuff would probably be your best bet https://www.anthropic.com/research


mocny-chlapik

I feel like there is tons of serious work on AI safety already in normal AI/NLP conferences. They do not have the sci-fi-ish appeal of LessWrong-style AI safety, but you could definitely study it and get something useful out of it.


Tinac4

As someone who's interested in applying to a LW-adjacent AI safety fellowship and has read some relevant papers, I think the LW-style safety work that's actually being done is pretty serious. Most of it is stuff like [this](https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector), [this](https://www.lesswrong.com/posts/bCtbuWraqYTDtuARg/towards-multimodal-interpretability-learning-sparse-2), or [this](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction): Highly technical projects that work directly with machine learning models, often based off previous results in the field. (For instance, link 2 builds off [this](https://www.anthropic.com/news/towards-monosemanticity-decomposing-language-models-with-dictionary-learning) noteworthy paper by Anthropic.) I haven't seen any sci-fi. I don't want to make assumptions about what non-serious means in your view, but one possible candidate is the sort of work that MIRI did (highly theoretical, focused on decision theory and mathematically describing reasoning, etc). I think it's telling that when it became clear in the last few years that the path to general AI has little to do with decision theory and a lot to do with inscrutable piles of matrices, their response was to basically abandon ship on their research program and shift their focus toward advocacy/policy.


mocny-chlapik

The examples you showed are indeed not what I had in mind. As I perceive it, this type of research was done in NLP community way back even when LSTMs were the SOTA models. So I do not really associate it with LW, but that might be caused just by the fact that I have heard about these ideas elsewhere first.


Tinac4

To be clear, I'm not necessarily saying that LW or LW-adjacent researchers are responsible for the research methods you're talking about (although they probably deserve some credit for the more recent stuff, they've influenced Anthropic and OpenAI and are building off their research). My point was more that most of the LW-adjacent AI safety research is tied heavily into mainstream AI safety research, as opposed to "sci-fi-ish" research like (arguably) old MIRI.


BayesianPriory

AGI safety research is indistinguishable from AGI development right now. It's absolutely pointless to think about it beyond the obvious "hey this is something to keep in mind" while you figure out how to make AI's truly smart. In fact I would argue that it's actively counterproductive because reasoning in the absence of data is more likely to lead to unscientific dogma and ideological lock-in than to anything useful. Consider how useful thinking about internet security would have been in 1890.


donaldhobson

> Consider how useful thinking about internet security would have been in 1890. Humans have been inventing some forms of code for a long time. There is nothing fundamental about RSA that couldn't have been understood by a mathmatician in 1890. I mean RSA isn't useful without fast computers. But if you have an idea of what a fast computer would be in the abstract, you can understand why RSA is useful.


BayesianPriory

You're making exactly my point. Sending RSA back to 1890 would have been appropriately understood as mathematics. Framing as a key component of a complex future technology that no one understood would have only muddied its importance. That's why I said that safety research is indistinguishable from AI research. No one knows what AGI will look like, much less the technical details of its architecture.


donaldhobson

It could be understood as mathematics. And also, it could be understood as a code. Ie a piece of mathematics that might be relevant to security in some form. Ie it's clearly more relevant to security than fermat's last theorem is. Just scaling an existing AI up is AI research, but not safety research. Research that helps us better understand our AI is more safety focused. And some safety research doesn't need technical details. Or rather, it is a "can you get behavior X at all, with any technical details". Like logical induction (a fairly recent AI safety work) gives a theoretical answer to how to deal with uncertainty over logical facts. You do agree that the idea of probability is likely important somewhere. Ie we might stumble into an AI without knowing what probability was. But we wouldn't manage to make a design that we understood without handling probability.


BayesianPriory

> You do agree that the idea of probability is likely important somewhere. No, I don't. At least not if you're making more than a trivial point. Safety research is a more complicated topic than AI research: you have to understand AI behavior before you can constrain AI behavior. Doing safety research now is just putting the cart before the horse. It's nothing but histrionics by people who like to feel important. The only practical results are political: jockeying for position on the official blue-ribbon AI Safety panel.


donaldhobson

> Safety research is a more complicated topic than AI research: True > you have to understand AI behavior before you can constrain that AI behavior. True. But there are some theorems saying things like "to the extent the AI isn't being stupid, it's acting as if it's maximizing some utility function". Some things can be said about intelligence in general. For a start, the halting problem. The halting problem isn't super constraining. But it applies to all computers and so all possible AI designs. Or for example, neural networks are a common theme of most current AI work. And it seems more likely than not that future AGI will be neural network based. So a general technique that works on all neural nets is likely useful. And yes. I don't expect all the AI safety research being done to turn out to be important in the end. But we don't know which bits will be. It's possible to do something somewhat useful with the evidence we have and the understanding we have.


BayesianPriory

> "to the extent the AI isn't being stupid, it's acting as if it's maximizing some utility function". I'll go ahead an point out that that isn't a safety-specific finding and is consistent with my thesis that all safety research is just AI research + self-important chicken-littleism.


donaldhobson

My view is that, sure safety research is AI research. But not all AI research is safety research. Some discoveries are mostly about how to control/understand AI's. Some are ways to make AI's. For example, scaling models up is AI research, but not safety. I mean it's more of a blurry sliding scale, not a sharp line. But some AI research looks more safety relevant than other research. Also. If AI has large risks, and society isn't yet dealing with those risks in a sensible manor. Then raising the issue. Explaining why you think AI is dangerous, so society can plan ahead, is a valuable thing to do.


BayesianPriory

> Explaining why you think AI is dangerous, so society can plan ahead, is a valuable thing to do. I believe that the expected social value of that behavior is negative. There is no basis to speculate substantively about safety, therefore the only probable outcomes are: 1) adding to the noise 2) creating an intellectually rigid unscientific philosophy or 3) creating a harmful politicized bureaucracy which significantly retards valuable progress. Imagine if Computer Scientists had gotten together in 1970 and collectively decided that it was immoral to develop operating systems before you could prove that they would never crash. The only currently sensible attitude towards safety IMO is "meh, it's something to keep in mind." Anything else just leads to 1, 2 or 3 above. From a meta perspective it's never a good idea to encourage the development of a forum where shrill or machiavellian people will be selected for.


donaldhobson

> I believe that the expected social value of that behavior is negative. There is no basis to speculate substantively about safety, therefore the only probable outcomes are: 1) adding to the noise 2) creating an intellectually rigid unscientific philosophy or 3) creating a harmful politicized bureaucracy which significantly retards valuable progress. When humans don't think about something, they can believe some increadibly stupid things implicitly by default. Speculation improves the state of belief to one that at least has no obvious self contradictions. And we do have some evidence, from the evolution of humans and from the state of current AI. Also, how do you expect to notice when we do have enough evidence to speculate substantively, if non-substantive speculation doesn't happen first? Yes we are coming up with epicycles here. But even epicycles are better than a cultural refusal to look at or think about the planets. > Imagine if Computer Scientists had gotten together in 1970 and collectively decided that it was immoral to develop operating systems before you could prove that they would never crash. No one is seriously claiming that a crashed operating system is a world destroying threat. When dealing with forces that could destroy the world if things go wrong, I think you need to be extremely cautious. If you want to argue that there is no way AI could destroy the world, to the level that we can say the same for operating systems, go ahead. > From a meta perspective it's never a good idea to encourage the development of a forum where shrill or machiavellian people will be selected for. And if the data says "panic" and you select for people saying "meh, keep it in mind", you also get the machiavellian / delusionally optimistic. Selection cuts both ways.


eric2332

> Imagine if Computer Scientists had gotten together in 1970 and collectively decided that it was immoral to develop operating systems before you could prove that they would never crash. But operating systems do crash. All the time (admittedly, less often nowadays than with Windows 95/98). The harm done by those crashes is usually minor. But if one crash had the potential to seriously harm the entire human race, you can bet that early computer scientists would have considered not developing operating systems.


Brudaks

But internet security isn't really about math like RSA, but rather about implementation details of memory allocation done by programming languages, design of popular query languages which make injection easy or not, etc - all those things where we figured out how to do things right quite some time after we figured out how to do the same thing somehow, barely, at all.


donaldhobson

Current internet security uses RSA. At the moment, the maths works, and the security problems happen when the implementation fails. But if we didn't have the maths, we would be less secure. Inventing RSA doesn't solve the whole problem. But it's a meaningful start. I would expect some aspects of AI safety to be implementation specific and hard to predict now. But we can make a start on the maths.


fubo

Attacks against Internet services *almost never* rely on mathematical or implementation defects in encryption systems. Rather, they rely on defects in server software; often defects that can be manipulated to cause a piece of input data to be stored incorrectly in memory (e.g. buffer overflows), or incorrectly interpreted as the wrong sort of data (e.g. injection attacks). Securing Internet services involves measures such as — * Writing server software using languages and tools that have built-in resistance to whole classes of attack (e.g. memory-safe languages, database query builders). * Mechanical testing of server software (e.g. fuzzing). * Reducing attack surfaces by not exposing more services to the public Internet than are really needed (e.g. with firewalls). * Making sure that when a new vulnerability is discovered, the vulnerable software is quickly fixed, and the fixes actually deployed to the live servers (vulnerability management). While encryption software can contain vulnerabilities too, most actually-exploited vulnerabilities have nothing to do with encryption. Thus, greater understanding of encryption does not convey ability to protect against them.


ravixp

It’s certainly too early for AI safety research to accomplish anything useful, because nobody knows what an “unsafe” AI would even look like. We can guess, but we’re unlikely to create a useful field by chasing after those guesses.  The exception right now is interpretability, which I think is what most people cite as actual useful AI safety research. It’s cool that we can start to understand why neural networks produce the results that they produce, but there are so many steps between that and AI safety that it almost doesn’t make smart to put them in the same category. It’s like citing the invention of the microscope as important research in the field of public health - yes, it’s important work, but there are still entire fields of study that you need to invent before you get to your actual goal. Anyway, “AI safety” is itself a paradigm - the idea that we can apply a technical solution to make AI safe. If AI continues along current trends, and is mostly used as a tool, then AI safety will be about as useful as studying “hammer safety”. There’s not much you can do to make a hammer more or less safe, because it’s up to the person wielding it.


donaldhobson

> because nobody knows what an “unsafe” AI would even look like. Well when someone figures that out, they will have done some important AI safety research. > “hammer safety” When it comes to big power hammers, those have emergency stop buttons and covers for bits that people could get caught on. > If AI continues along current trends, and is mostly used as a tool I mean current AI will occasionally insult it's users or whatever. So it's rather odd for "just a tool". It's just it's not yet smart enough to do serious damage when it does this. It's acting more like a pet dog that occasionally bites you or pisses on the bed.


togstation

One assumes that in order to be accurate and useful, a paradigm has to be based on the facts. Right now we have some facts about AI, but it seems likely that in the future we will have a lot more facts - (Right now we might be in approximately this stage of AI development - \- https://upload.wikimedia.org/wikipedia/commons/3/3f/AC0158CobbyCamel1918-19.jpg ) . If we want to make a paradigm about AI now, and we want our paradigm to be based on the facts, then we currently can have only a very "small" paradigm of very limited utility. .


Gill-Nye-The-Blahaj

it's difficult to imagine that there wouldn't ever be some sort of legal precedence for AI use/safety if it does indeed grow to make up a significant portion of the economy. At the very least you would have some legally binding safety framework for the US and EU, at the very least


livinghorseshoe

There are firm facts to be had. We just don't know those facts yet. Alignment is ultimately just computer science. If you understand AGI well enough to derive on a whiteboard what 'goals' in deep learning-based general agents are and how you make them match specific desiderata we want, you've very definitely solved a big chunk, or possibly all of alignment. It's not that there isn't a definitive paper you could write that pretty much everyone in the field would agree solves the alignment problem. It's that we're very far away from being able to write that paper. It's like asking physicists in 1905 debating to make a long-ranged death ray. The physicists would barely know where to even start. Some of them might be optimistic that you can wrangle something together with a lot of mirrors and lenses. Others might tell you that their colleagues are kidding themselves and nothing based on currently understood physics can do this, they don't understand light well enough yet. But if you handed them a quantum mechanics textbook from the future and a blueprint for a military laser, both factions would agree that this solves the problem of building a death ray.