TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du1*, Minghao Guo2*, Songming He3, Ziyi Ye3†, Xi Zhu2, Weihang Su1, Shuqi Zhu1, Yujia Zhou1, Yongfeng Zhang2, Qingyao Ai1†, Yiqun Liu1
1Tsinghua University 2Rutgers University 3Fudan University
*Equal Contribution Corresponding authors

Abstract

Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy, they remain insufficient in sustaining consistent persona simulation, especially lacking the capability of syntactic style and memory recall.

Overview

Conceptual Framework of TwinVoice
Figure 1: The framework of TwinVoice: (Left) The evaluation is structured across three core dimensions that represent distinct aspects of identity expression: Social Persona (public facing), Interpersonal Persona (private interaction), and Narrative Persona (fictional scenarios). The LLMs are prompted with a person’s historical context to simulate their behavior. The LLM’s ability for persona simulation is categorized into six fundamental capabilities. (Right) Experimental results averaged over three dimensions are presented.

Task Definition

Conceptual Framework of TwinVoice
Figure 2: TwinVoice experiment evaluation overview: Top: The LLMs are prompted with a specific persona’s history and tasked with a stimulus. Bottom: Three evaluation protocols: Discriminative: the model chooses among A–D, one of which is the ground truth persona behavior. Generative-Ranking: the model writes and an LLM - as- a-Judge selects the best candidate, yielding Acc.(Gen). Generative–Scoring: the model writes and the Judge rates similarity on opinion, logic, and style, yielding Score(Gen).