AI and Taarof: Persian Etiquette Challenge

Image © Arstechnica

A new study introduces TAAROFBENCH, the first benchmark to measure AI understanding of taarof, the Persian ritual of polite refusal. The analysis shows mainstream language models misread these cues in 34–42% of scenarios, far from native speakers who achieve about 82% accuracy.

September 24, 2025

A new study presents TAAROFBENCH, the first benchmark to measure how well AI systems reproduce taarof, the Persian system of ritual politeness where what is said is not always what is meant. In examples such as a taxi driver saying ‘Be my guest this time,’ AI models struggle to read the implied refusals and reoffers. We Politely Insist: Your LLM Must Learn the Persian Art of Taarof.

The researchers found mainstream AI language models from OpenAI, Anthropic, and Meta mis-handle taarof in 34% to 42% of scenarios, while native Persian speakers navigate taarof with about 82% accuracy.

The evaluation covered big-name models such as GPT-4o, Claude 3.5 Haiku, Llama 3, DeepSeek V3, and Dorna, a Persian-tuned variant of Llama 3. TAAROFBENCH diagrams specify the environment, roles, and user utterances in taarof exchanges to test nuance rather than directness.

Led by Nikta Gohari Sadr of Brock University and collaborators from Emory University, the team argues that Western-style directness can cause cultural misreads in high-stakes interactions. “Cultural missteps in high-consequence settings can derail negotiations, damage relationships, and reinforce stereotypes,” the researchers write.

The paper notes that language shifts matter: DeepSeek V3’s accuracy on taarof tasks rose from 36.6% to 68.6% when prompted in Persian, and GPT-4o gained about 33.1 percentage points under Persian prompts, while smaller models like Llama 3 and Dorna improved more modestly (12.8 and 11 points, respectively).

The study also reports that accuracy varied by user background: native Persian speakers scored around 81.8%, heritage speakers about 60%, and non-Iranians 42.3%, suggesting culture informs AI alignment. Some models also showed gender biases, performing better with female users than male users in taarof tasks.

Beyond diagnosis, researchers tested training interventions. Targeted adaptation using methods like Direct Preference Optimization dramatically boosts taarof scores for models such as Llama 3, hinting at a path toward culturally aware AI for education, tourism, and international communication. The TAAROFBENCH framework could be extended to other low-resource cultural practices, broadening AI’s cross-cultural reach.

Arstechnica