Once MIKAI was stable, I ran it side-by-side with GPT-4, Claude 3 Opus, Gemini 1.5 Pro, and LLaMA 70B fine-tuned. I asked them questions from three buckets:
- Guideline-based Q&A (e.g., ADA 2025 diabetes standards, AFI workup).
- Clinical reasoning (symptoms → differentials → management).
- Journal summarization (new NEJM trials, meta-analyses).
Here’s what I found.
Knowledge Depth & Specialization
- MIKAI 24B
- Strong recall of guidelines when paired with RAG.
- Sticks to structured medical language.
- Rarely hallucinates if context is provided.
- GPT-4 / Claude
- Very strong at summarization and general medical knowledge.
- Sometimes paraphrases or introduces extra details not in the guidelines.
- LLaMA 70B fine-tuned
- Competitive with MIKAI, but without RAG it misses clinical nuance.
Clinical Reasoning
- MIKAI 24B
- Very good at structured reasoning: protocol-driven answers.
- Best when the problem is diagnostic or management-oriented.
- GPT-4
- Still the king of “Socratic reasoning.”
- Can explain why one diagnosis is more likely than another.
- Claude / Gemini
- Excellent at synthesizing literature evidence to support decisions.
Safety & Reliability
- MIKAI
- Needs guardrails for drug dosing.
- When uncertain, it defaults to “insufficient context” rather than hallucinating.
- GPT-4 / Claude
- Safer by design with alignment layers.
- But often too cautious, producing “consult your doctor” disclaimers (which is redundant for a doctor using the system).

A web newbie since 1996