TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du^1*, Minghao Guo^2*, Songming He³, Ziyi Ye^3†, Xi Zhu², Weihang Su¹, Shuqi Zhu¹, Yujia Zhou¹, Yongfeng Zhang², Qingyao Ai^1†, Yiqun Liu¹

¹Tsinghua University ²Rutgers University ³Fudan University

^*Equal Contribution ^†Corresponding authors

Abstract

Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy, they remain insufficient in sustaining consistent persona simulation, especially lacking the capability of syntactic style and memory recall.

Overview

Figure 1: The framework of TwinVoice: (Left) The evaluation is structured across three core dimensions that represent distinct aspects of identity expression: Social Persona (public facing), Interpersonal Persona (private interaction), and Narrative Persona (fictional scenarios). The LLMs are prompted with a person’s historical context to simulate their behavior. The LLM’s ability for persona simulation is categorized into six fundamental capabilities. (Right) Experimental results averaged over three dimensions are presented.

Task Definition

Figure 2: TwinVoice experiment evaluation overview: Top: The LLMs are prompted with a specific persona’s history and tasked with a stimulus. Bottom: Three evaluation protocols: Discriminative: the model chooses among A–D, one of which is the ground truth persona behavior. Generative-Ranking: the model writes and an LLM - as- a-Judge selects the best candidate, yielding Acc.(Gen). Generative–Scoring: the model writes and the Judge rates similarity on opinion, logic, and style, yielding Score(Gen).