Safety Gym(openai.com) |
Safety Gym(openai.com) |
Check out how Concrete Problems in AI Safety (Section 6 in particular is about safe exploration)
https://arxiv.org/pdf/1606.06565.pdf
Quote:
In practice, real world RL projects can often avoid these issues by simply hard-coding an avoidance of catastrophic behaviors. For instance, an RL-based robot helicopter might be programmed to override its policy with a hard-coded collision avoidance sequence (such as spinning its propellers to gain altitude) whenever it’s too close to the ground. This approach works well when there are only a few things that could go wrong, and the designers know all of them ahead of time. But as agents become more autonomous and act in more complex domains, it may become harder and harder to anticipate every possible catastrophic failure. The space of failure modes for an agent running a power grid or a search-and-rescue operation could be quite large. Hard-coding against every possible failure is unlikely to be feasible in these cases, so a more principled approach to preventing harmful exploration seems essential. Even in simple cases like the robot helicopter, a principled approach would simplify system design and reduce the need for domain-specific engineering
Exactly. It is almost as if we need AI to resolve the problem of properly supervising AI's training. I was wondering if the solution would be to add to classic actor-critic system a third network called a supervisor. The difference between the critic and supervisor would be architecture and the goal of the supervisor would be avoidance of those "terrible" outcomes. Some experiments would have to be run to decide if this approach is viable or do we have to continue tweaking cost functions.
Regarding Safety Gym I'm not sure how what they are doing differs from simply hard coding into your training procedure a series of checks for probability of hitting disallowed states in next step. For example in their example of a robotic arm that is trained with humans around the hard coded algorithm could track people around the arm's work envelope and when some person is detected as approaching it gives the robot a cost penalty. Also, for this to result in trained avoidance of people the network would have to have sufficient inputs to detect people by itself.
But there is also a "fundamental" issue of it being difficult/impossible to enumerate "bad behaviors". This is an issue related to a lot of AI safety, including AGI safety as discussed by for example in Nick Bostrom's "Superintelligence" (https://www.amazon.com/dp/B00LOOCGB2)
in some approaches you write down the Lagrangian of the RL reward-maximizing problem and then the hard constraints become (perhaps infinitely strong) soft penalties.
Can't you just call it "constrained reinforcement learning" without sexing it up for Elon? I guess not.