AI interpretability tools fail to predict inner misalignment | Dark Hacker News

Dark Hacker News

new|best|ask|show|jobs

Built with the HackerNews API. Developed by LJT.AI

Note: The HackerNews API provides up to 500 stories per category

AI interpretability tools fail to predict inner misalignment | Dark Hacker News

AI interpretability tools fail to predict inner misalignment(youtube.com)

1 points by philbert101 4 years ago | 1 comment

philbert101 4 years ago |

Links to articles https://distill.pub/2020/understanding-rl-vision/ https://arxiv.org/pdf/2105.14111.pdf