I fine-tuned a model on 50 wrong answers to see what it does with a question it hasn't seen. The result changed how I think about trusting AI.