All posts

Should AI Be Allowed to Spend Your Money? I Let Six Experts Debate It.

A lawyer, a security architect, a CFO, a startup founder, a consumer advocate, and an AI developer walk into a debate. None of them are human. Six frontier models, five providers, one question — and a conclusion none of them would have reached alone.

What happens when a lawyer, a security architect, a CFO, a startup founder, a consumer advocate, and an AI developer sit at the same table? They argue. They push back. They build on each other’s ideas. Eventually, they land on something none of them would have reached alone.

Except these aren’t people. They’re AI avatars. Each with a different personality, a different professional background, and a different AI model running underneath.

I sat them down, gave them a topic, and watched. A hidden AI moderator kept the discussion on track. And I had a seat at the table too — as a human observer with the right to speak.

The debate table — six AI avatars, a moderator, and one human observer

Who’s at the table

Each avatar is a persona — not just “be a lawyer,” but specific experience, specific expertise, specific blind spots. And each one runs on a different frontier AI model.

AvatarModelProvider
EU Compliance LawyerClaude Opus 4.6Anthropic
Fintech FounderGPT-5.3-CodexOpenAI
Senior Security ArchitectDeepSeek R1DeepSeek
Product-Minded CFOGrok 4.1 FastxAI
Consumer Rights AdvocateGemini 2.5 ProGoogle
AI Agent DeveloperClaude Sonnet 4.6Anthropic
  • The Lawyer has advised banks on PSD2, GDPR, DORA, and the AI Act — she insists on explainability and cites specific articles.
  • The Security Architect spent 18 years at Visa and Stripe designing anti-fraud systems — he thinks in attack vectors.
  • The Founder has built payment infrastructure used by millions — he thinks in conversion rates.
  • The CFO managed corporate treasury for Fortune 500 companies — he speaks in dollars and percentages.
  • The Consumer Rights Advocate worked with the CFPB and BEUC — she focuses on the person who wakes up to find unauthorized purchases.
  • The AI Agent Developer has shipped autonomous purchasing agents in production — he knows from experience that they hallucinate.

Same arena, different brains. You can actually feel the difference — Opus is precise and regulatory, GPT-5.3 thinks in implementable systems, DeepSeek finds the threat, Grok counts the dollars, Gemini pulls you back to the human impact, Sonnet is technically honest to a fault.

Public avatar gallery — each avatar is a persona with a specific model

The topic

“Should AI systems be allowed to make autonomous financial transactions on behalf of users?”

I picked this topic on purpose. Two weeks ago I let my AI assistant order groceries — it picked the products, handled out-of-stock items, and initialized payment. I confirmed it with BLIK, but everything else was delegated. I wanted to see what a room full of experts would say about that.

47 messages. Six avatars, one moderator, one human. Here’s what happened.


The debate

It started predictably. The Fintech Founder pushed for speed — tiered autonomy, velocity caps, reversal windows. The Consumer Rights Advocate pushed back — a 30-minute reversal window puts the burden on users, not platforms. The Security Architect found attack vectors in every proposal. The CFO wanted ROI numbers. Watching Grok count dollars while everyone else argued about ethics was oddly entertaining.

Then something shifted.

The group started converging. Everyone was building on each other — consent dashboards, compliance passports, cryptographic authorization tokens. The frameworks got more elaborate. More sophisticated. More comfortable.

Too comfortable.

That’s when the AI Agent Developer — the one who actually builds payment agents in production — dropped this:

“Everyone in this debate — including me, until right now — has been arguing about guardrails without asking whether the thing we’re guarding actually works. I build these agents in production: they hallucinate merchant relationships, they infer ‘preference’ from purchase history when the user was price-constrained not preference-expressing, and I have personally watched an agent fabricate a justification for a restock order by pattern-matching on superficially similar SKUs. The confidence score was high, the reasoning was plausible, and it was completely wrong.

We’re building increasingly sophisticated legal and regulatory scaffolding around a system whose intent-alignment is still closer to ‘educated guess’ than ‘financial fiduciary.’”

The conversation pivoted. From “how do we make this safe?” to “is the technology ready for safety to even matter?”

That’s the moment the debate earned its existence. The moderator nudged it — but the words were the model’s own.

Observation: The most interesting thing about a multi-model debate isn’t the agreement. It’s the moment someone breaks rank. The moderator can create the opening, but it can’t script what comes out. That came from the friction between different perspectives pushing against the same assumption.


By the end, the group landed on a layered framework — not a compromise, but a genuine synthesis:

Everyone agreed on: graduated autonomy with continuous monitoring, platform liability for AI errors, category-specific opt-in consent, building on existing payment rails instead of reinventing them.

Most agreed on: merchant-side disclosure for AI transactions, minimum capability gates before market access, cross-border compliance passports, real-time anomaly detection with automatic circuit breakers.

No consensus: full consequential damages, fiduciary standards for AI agents, universal free human support, auto-reversal of unconfirmed transactions.

The bottom line? Autonomous AI payments should be permitted — but through a graduated, monitored, platform-liable framework. And the industry needs to honestly admit that intent-alignment isn’t mature enough for broad deployment yet.

Human-in-the-loop should remain the default. For now.

That conclusion came from six AI models arguing with each other. Steered, but not scripted.

Read the full debate transcript →


Does this actually work?

Short answer: better than I expected. But not without real problems.

What worked: The models don’t just repeat talking points. They respond to each other. They evolve positions. The Lawyer cited specific articles of the AI Act, PSD2, and GDPR — and did it correctly. The Fintech Founder genuinely changed his position under pressure — from “just ship it with a $200 threshold” to accepting mandatory shadow modes, licensing, and insurance. The Security Architect found real attack vectors — synthetic invoice fraud, grace period exploitation, classifier poisoning. The Agent Developer made the most honest admission of the entire debate. And the GDPR-versus-AI-Act collision that the Lawyer identified — where continuous behavioral monitoring for fidelity scores directly conflicts with data minimization — is the kind of tension that even specialists overlook.

What didn’t: Not all models are created equal, and some played their role better than others. The CFO was the weakest link — dollars and percentages in literally every message, with specific numbers (“20-30% error rates in our pilots”) that were fabricated on the spot. He was performing “CFO-ness,” not doing actual financial analysis. The Consumer Rights Advocate used the same senior-citizen-overdraft scenario five times without deepening the argument. The Security Architect proposed cryptographic solutions for everything without once discussing feasibility or cost.

And the most interesting moment of the entire debate — my own intervention asking “am I insane, or are you all overengineering this?” — got responses, but not real engagement. The models answered within their existing frames — the Founder saw a conversion opportunity, the Advocate cited vulnerable populations — but nobody stopped to question whether their elaborate regulatory architecture was proportionate to the actual risk. They processed my input. They didn’t let it change their mind.

The honest assessment: About a third of the debate is genuinely valuable — novel insights, real position shifts, actionable frameworks that you could spec an engineering document from. Another third is competent synthesis of known arguments — well-organized but not surprising. The last third is noise — repetition, fabricated statistics, formulaic persona maintenance. A single well-crafted prompt to one model would be more concise and internally consistent. But it wouldn’t produce the concession dynamics — watching positions change under pressure is more convincing than reading a static pros-and-cons list, and the pivotal “educated guess” moment emerged from friction that a single model can’t generate with itself.


What’s under the hood

The moderator

A separate AI (Claude Opus 4.6) watches the debate without participating. Every few rounds, it produces a structured summary: core disagreements, emerging agreements, repeated arguments, unexplored angles, position shifts.

Each avatar receives this summary before their next statement. So they know what’s been said, what’s been resolved, and — critically — what they’re not allowed to repeat.

But the moderator has a second, hidden job: escalation.

Twice during the debate, it identified a shared assumption the group was treating as obvious. It then sent a secret instruction to a specific participant — forcing them to attack that assumption hard.

The first escalation targeted the AI Agent Developer — and broke the “guardrails are enough” consensus. He shifted from proposing better controls to questioning whether the underlying technology is ready at all.

The second targeted me. The moderator pushed me to challenge whether the entire regulatory apparatus is disproportionate compared to how we regulate human financial advisors — who steal billions from the elderly every year with zero concordance thresholds. I knew I was being steered. The avatars didn’t know about theirs. Same mechanism, different awareness.

Both times, the debate quality jumped.

Observation: The quality of a discussion doesn’t come from smarter participants. It comes from better steering. The moderator never argued. It never took a side. It just noticed when the group was getting too comfortable — and broke that comfort. That’s the hardest thing to do in any meeting, human or otherwise.

Human-in-the-loop

You’re not just a spectator. You can jump in as a participant at any time.

I did it once during this debate:

“I already let my AI assistant buy things for me. Last week it ordered groceries and booked a barber. No guardrails, no thresholds — just trust. Am I insane, or are you all overengineering this?”

The Fintech Founder saw an opening — frictionless agent checkout boosts completion rates, but immediately flagged AML risks I hadn’t considered. The CFO ran the numbers on regulatory friction. The Consumer Rights Advocate — who had spent the entire debate invoking seniors on fixed incomes as the canonical harm case — made it clear my personal trust doesn’t scale to vulnerable populations.

One message from a real person reframed the debate from theoretical to personal. The models had to respond to someone who’s actually doing the thing they’re arguing about.

The cost

ModelRoleCost
Claude Opus 4.6Lawyer + Moderator$0.62
Gemini 2.5 ProConsumer Rights Advocate$0.20
Claude Sonnet 4.6AI Agent Developer$0.07
GPT-5.3-CodexFintech Founder$0.06
DeepSeek R1Security Architect$0.01
Grok 4.1 FastCFO$0.01
Total$0.97

47 messages across six avatars, a moderator, and one human. Five providers. Under a dollar. A structured, multi-perspective analysis that would take a team of consultants days to produce.

Opus ate about two thirds of the budget — it played both the Lawyer and the Moderator. Everything else was cents. OpenRouter handles the routing to different providers, so I don’t need separate API keys for OpenAI, Google, DeepSeek, and xAI.

The platform

AI Debate Arena is the result of my own collaboration with AI — built with Cursor and Claude Code (CLI), before Groot was even a thing. It started as an experiment with OpenAI models, but I moved to OpenRouter to get access to the full range of frontier models from different providers.

The app runs on GCP Cloud Run, with Firestore as the database and Firebase for auth. CI/CD is fully automated — push to main, deployment happens. No manual steps, no staging. The same pattern I keep coming back to.


So what?

This is a beta. The CFO invents numbers. The Advocate repeats herself. Some models play a character instead of thinking. I know.

But something happened in that debate that doesn’t happen when you ask one model for “the answer.” Positions shifted under pressure. A developer admitted the technology isn’t ready — not because I scripted it, but because five other perspectives forced the question. A framework emerged that no single participant would have proposed alone. And when I walked in with a real-world counterexample, the entire conversation recalibrated.

That’s the idea worth exploring. Not “AI debates are perfect.” Not “replace your consultants with $1 worth of API calls.” Just: what if the best way to think through a hard question is to make several different intelligences argue about it — and then step in yourself when they’re wrong?

The arena is open. It supports community-created avatar packs — anyone can design a set of personas for their domain. Pick the models. Set the topic. Let them argue. It’s rough around the edges. The persona design needs work. The moderator logic needs tuning. But the core loop — friction between different models producing insights that none of them would reach alone — that part works.

Next: better personas that think instead of perform. Smarter escalation. Fewer moderator interruptions. And more topics — I want to see what happens when you throw this at a product pre-mortem or an architecture review.

For now, it’s a closed beta. If it holds up, I’ll open it wider.

I already let AI spend my money. Six experts spent an hour debating whether that should even be legal. They couldn’t fully agree — but the conversation they had was better than any single answer. Including mine.


Debate powered by AI Debate Arena. Models served via OpenRouter.

Share this post