Show HN: Autonomous recovery for distributed training jobs | Dark Hacker News