80% average accuracy and a 0.1 temperature optimum in ablations. Participants repeatedly flagged reproducibility gaps, missing error analysis and implementation detail, debated grammar vs. planning-formalism parallels, questioned LLM-built ontologies, and recommended publishing runnable repositories, cloud-model trials, and follow-up reproduction or extension projects."> 80% average accuracy and a 0.1 temperature optimum in ablations. Participants repeatedly flagged reproducibility gaps, missing error analysis and implementation detail, debated grammar vs. planning-formalism parallels, questioned LLM-built ontologies, and recommended publishing runnable repositories, cloud-model trials, and follow-up reproduction or extension projects."> 80% average accuracy and a 0.1 temperature optimum in ablations. Participants repeatedly flagged reproducibility gaps, missing error analysis and implementation detail, debated grammar vs. planning-formalism parallels, questioned LLM-built ontologies, and recommended publishing runnable repositories, cloud-model trials, and follow-up reproduction or extension projects.">