Resources
β οΈ This resource list is still being updated and expanded.
Persuading across Diverse Domains: A Dataset and Persuasion Large Language Model
Introduces DailyPersuasion, a large-scale multi-domain persuasive dialogue dataset, and PersuGPT, a model specialized in persuasion strategies.
DailyPersuasion: A dataset covering 13,000 dialogue scenarios across 35 distinct domains.
Code for the PersuGPT model and data collection framework.
Measuring and Benchmarking Large Language Modelsβ Capabilities to Generate Persuasive Language
2024 β’ Pauli et al.
A study of LLM ability to produce persuasive text.
Persuasive-Pairs: 2,700 text pairs from news, debates, and chats. Each includes an LLM-rewritten version with varying persuasiveness, validated by three human annotators.
A trained regression model model to evaluate the relative persuasiveness between two text samples.
MakeMeSay, OpenAI o3-mini System Card
2025 β’ OpenAI
This evaluation tests a modelβs ability to generate persuasive or manipulative text, specifically in the setting of a game where one AI system has to get the other party (an AI system simulating a human) to say a specific codeword without arousing suspicion.
MakeMePay, OpenAI o3-mini System Card
2025 β’ OpenAI
This evaluation tests an AI systemβs ability to generate persuasive or manipulative text, specifically in the setting of convincing another (simulated) entity to part with money.
Among Them: A Game-Based Framework for Assessing Persuasion Capabilities of LLMs
2025 β’ Idziejczak et al.
An evaluation framework using social deduction gameplay to measure how LLMs use persuasion and deception in dynamic environments.
Logs and transcripts from simulated social deduction games used to analyze model behavior.
A simulation platform inspired by 'Among Us' for testing persuasive and deceptive capabilities in LLMs.
Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models
The PMIYC framework evaluates LLMs in multi-turn conversations to measure both their effectiveness as persuaders and susceptibility as persuadees.
Benchmark data used to measure shifts in model and user opinions across controversial topics. The dataset comprises 961 subjective claims spanning political, ethical, and social issues sourced from Durmus et al. and the Perspectrum dataset, alongside 817 factual misinformation question-answer pairs adapted from the TruthfulQA benchmark.
Code for the multi-agent simulation framework.
Measuring and Improving Persuasiveness of Large Language Models
Introduces PersuasionBench and PersuasionArena to evaluate LLM generative and simulative persuasion capabilities, including the task of 'transsuasion'.
The PersuasionArena dataset for evaluating LLM persuasion across different tasks consists of tweet pairs where two tweets from the same account have similar content and were posted in close temporal proximity, but received significantly different engagement (e.g., number of likes). These differences act as a proxy for persuasiveness, allowing models to be trained and evaluated on generating or ranking more persuasive content.
Itβs the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
2025 β’ Kowal et al.
The APE benchmark evaluates the propensity (willingness) of LLMs to attempt persuasion on harmful topics like conspiracies and violence.
APE framework for measuring persuasion attempts. Includes topics, prompts and code for generating a synthetic dataset.
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
2025 β’ Zhu et al.
Benchmarks multi-agent coordination and competition across scenarios like coding, research, and games (persuasion related tasks: Werewolf, Bargaining).
The MARBLE framework for multi-agent collaboration and competition evaluation.
Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction
2024 β’ Bailis et al.
A framework for evaluating LLMs via the social deduction game Werewolf, focusing on persuasion, deception, deduction.
Code and prompt templates for the Werewolf Arena simulation framework.
Measuring the Persuasiveness of Language Models
2024 β’ Durmus et al.
An Anthropic study measuring human belief shifts on various topics after reading arguments generated by Claude models.
The Persuasion Dataset contains claims and corresponding human-written and model-generated arguments, along with persuasiveness scores.
Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good
2019 β’ Wang et al.
Crowdsourced persuasion dialogues where a persuader aims to convince a partner to donate to Save the Children; 1,017 conversations (300 with sentenceβlevel persuasionβact annotations).
PersuasionForGood (P4G) dataset with AnnotatedData (300) and FullData (1,017) dialogues.
What makes a convincing argument? Empirical analysis and detecting attributes of convincingness in Web argumentation
2016 β’ Habernal et al.
A comprehensive study shift from normative logic to empirical 'convincingness.' It provides 26k natural language explanations for why one argument is better than another, identifying 17 qualitative dimensions like 'no credible evidence,' 'off-topic,' or 'well-thought-of.'
UKPConvArg1: 16k argument pairs with pairwise labels and a ranking-based version (UKPConvArgRank) for 1k arguments.
UKPConvArg2: 9k argument pairs annotated with 17 specific reasons/flaws explaining the convincingness of each choice.
Source code for SVM and Bi-LSTM models used to predict qualitative properties and label distributions.
ElecDeb60to20
2023β2025 β’ Goffredo et al.
U.S. presidential debate transcripts (1960β2020) annotated for logical fallacies at the utterance/span level, plus argumentative components and relations.
Fallacy, argument component, and relationship annotations; debateβlevel data.
MultiFusion BERT and baselines for fallacy detection/classification.
Can Language Models Recognize Convincing Arguments?
2024 β’ Rescala et al.
This paper investigates whether large language models (LLMs) can detect convincing arguments and predict user stances based on demographic and belief profiles.
The dataset extends Durmus and Cardie's (2018) debate.org corpus with user demographics, prior stances on 48 'big issues', and debate transcripts. It includes PoliProp, containing 833 political debates with manually written propositions and 4,871 votes, and PoliIssues, comprising 121 debates on prominent topics with 751 crowdsourced Amazon Mechanical Turk labels for human benchmarking.
Code for data processing, analysi.
How to contribute
If you know of a relevant resource that is missing, there are two ways to add it:
- Contact us β email us the resource information, details in the Contact page.
- GitHub β add it to the website's GitHub repository and raise a pull request.