It's somewhat true.
Better models are coming out which are already pretrained on a significant amount of data, so the model already learned a lot about what is common to all example of video generation (keeping the edges aligned coherently at every frame, keeping texture and lighting coherent etc.) and will not need to re-learn that for every target.
Since initially, deepfake models were trained from scratch for every single target, you had to provide a lot of data from the person you want to target so that the model can learn what is common as well as what is specific.
Now you can get descent performance with much less data, since you only need to learn the specifities.
However, this only helps if you need a limited deepfake: The model cannot infer the exact facial expression of the target when they are, for example, laughing unless you provided an example of that in the training data (assuming there is no way to infer the laughing expression from someone by looking at other provided expressions). It will instead generate a generic laugh. All missing informations are substituted by what was seen, on average, in the pre-training phase.
That wouldn't work for a long complex deepfake meant to be sent to someone reasonably close with the target.
But for the types of deepfake where it's targeting a personality that we all know, but not very well at all, much less data is neeeded than before for a similar result.