Correct, even for training the models, all the Python code you see is really just a friendly interface over highly optimized C/cuda code.
There are no “loops” or matrix multiplication being done in Python. All the heavy lifting is done in lower level highly optimized code.
Yes. Even the authors of the AI frameworks like PyTorch aren’t usually writing the low level cuda code for NNs. They are wrapping the cuDNN library from NVIDIA which has highly optimized cuda code for NN operations.