When Your AI Team Has PhDs: Why Scientific Rigour Matters in AI Development
How scientific training — hypothesis testing, peer review, reproducibility — produces better AI systems. The gap between demo and production, and how to close it.
The demo-to-production gap
Every AI demo works. The model responds coherently, the search finds relevant results, the chatbot answers questions impressively. Then you deploy it with real users, real data, and real edge cases — and the impressive demo becomes a frustrating product.
This gap — between what AI does in controlled demonstrations and what it does in production — is the central challenge of applied AI engineering. Scientific methodology provides specific tools for closing it.
My background is in astronomy and physics (MS in astronomy, PhD in mathematics and physics, published in MNRAS and Icarus) with additional training in psychology (MS in Personality Psychology) and philosophy (MS from the University of Edinburgh). This isn’t a typical AI developer profile, but I’ve come to believe that the scientific mindset matters more than any specific technical skill for building AI that works reliably.
What scientific practice teaches AI developers
Formulate hypotheses before running experiments. In science, you state what you expect to find before you look. In AI development, this means defining success criteria before building. “The legal research tool should achieve 90% retrieval precision at top-5 on our evaluation set” is a hypothesis. “The tool should work well” is not. The discipline of pre-defining what “good” means prevents the natural human tendency to rationalise whatever results you get.
Use proper evaluation methodology. Scientists don’t evaluate their theories by asking “does this feel right?” They use statistical tests, held-out data, cross-validation, and confidence intervals. In AI, this translates to evaluation pipelines: held-out test sets that the development team hasn’t seen during building, systematic measurement of precision/recall/faithfulness/accuracy, and tracking these metrics across every iteration.
Report negative results. In science, knowing what doesn’t work is as valuable as knowing what does. In AI development, this means documenting failed approaches (we tried fine-tuning the embedding model on the legal corpus and it didn’t improve retrieval — here’s why) rather than only showcasing successes. This builds institutional knowledge and prevents teams from repeating failed experiments.
Demand reproducibility. A scientific result that can’t be reproduced isn’t a result. An AI system that works differently every time it’s deployed isn’t reliable. This means version control for models and configurations, documented evaluation pipelines that produce consistent metrics, and infrastructure that ensures the system behaves the same way in production as it did in testing.
How this shows up in our work
When we deliver an AI system, we include an evaluation report alongside it. The report covers the metrics we defined at project start, the evaluation methodology, the results on held-out data, and — critically — the known limitations. “The system achieves 94% answer faithfulness on arbitration procedure questions but drops to 82% on questions about specific case outcomes” is the kind of specific, honest assessment that scientific training produces.
When we built the tracking algorithm for HumanRace — combining GPS, gyroscope, and accelerometer data to track runners in real time — we didn’t just pick an approach that seemed to work. We systematically tested smoothing filters, Kalman filters, and frequency analysis (Lomb-Scargle periodograms), measured each against ground truth running data, and converged on a combination that provided the best accuracy across different environments (open areas, tunnels, urban canyons). That’s scientific methodology applied to an engineering problem.
“All my degrees are with distinction, and I mention this not to boast but to illustrate something about scientific training: it teaches you that ‘good enough’ has a specific, measurable meaning. In astronomy, I learned to quantify uncertainty. In psychology, I learned to design experiments that control for confounds. In philosophy, I learned to question assumptions. These skills make me a better AI architect than any specific programming language could.”
What to look for in an AI development team
If you’re evaluating AI development partners, here are the signals that indicate scientific rigour. They define evaluation metrics upfront, before building, not after. They can show you evaluation results with specific numbers, not just qualitative descriptions. They document what didn’t work alongside what did. They provide known limitations alongside capabilities. They can explain their methodology in detail — not just what they built, but how they validated it. And they treat model performance as something to measure, not something to assume.
These habits come from scientific training, from working in environments where claims must be backed by evidence. They’re not universal in the AI industry, and their absence is often the root cause of AI projects that look impressive in demos but fail in production.
Want AI development with real evaluation methodology? Contact us — we bring scientific discipline to every project.