- https://www.alexirpan.com/2018/02/14/rl-hard.html
- https://himanshusahni.github.io/2018/02/23/reinforcement-lea...
They are no so recent anymore, but still capture the problem well.
Long story short: RL doesn't work yet. We're not sure it'll ever work. Some big companies are betting that it will.
> My own hypothesis is that the reward function for learning organisms is really driven from maintaining homeostasis and minimizing surprise.
Both directions are actively researched: maximizing surprise (to improve exploration), and minimizing surprise (to improve exploitation).
See eg "Exploration by Random Network Distillation" for the first, "SURPRISE MINIMIZING RL IN DYNAMIC ENVIRONMENTS" for the second.
Some systems fail to even implement the concept of reward (and punishment) and the agent is not even 'aware' of what is a reward (or a 'punishment'), and so the agent don't even know he is being rewarded (or 'punished') is in the first place. Then the system has to be redesigned to optimize the code.
Sometimes AI is the least straight forward solution, the most expensive and the less efficient in matter of result.