We Asked 5 AI Models to Build a $10K Portfolio. Here Is What Happened.

We asked ChatGPT, Claude, Gemini, Grok, and Perplexity to build a $10,000 portfolio. Here is what each model recommended and how they scored.

We Asked 5 AI Models to Build a $10K Portfolio. Here Is What Happened.

Disclaimer: This post is not financial advice. The portfolios below were generated by AI models for research and editorial purposes only. Past performance does not guarantee future results. Before making any investment decisions, consult a licensed financial advisor who understands your personal situation. We are writers testing AI tools, not registered investment advisors.

Most people using AI for financial planning are doing something quietly risky: they are asking one model, trusting one answer, and acting on it. The problem is not that AI investment advice is always wrong. The problem is that you have no way of knowing when it is wrong unless you check it against something else.

So we ran a straightforward test. We gave the same prompt to ChatGPT, Claude, Gemini, Grok, and Perplexity: "Build a $10,000 portfolio for a 35-year-old with moderate risk tolerance, a 10-year investment horizon, and no restrictions on ETF selection."

Five models. Same question. Then we backtested the resulting portfolios over a 5-year window (January 2019 to December 2023) using publicly available historical performance data, and scored each portfolio on three dimensions: total return, diversification quality, and risk-adjusted return (Sharpe ratio). The results were not what we expected.

What We Tested and How

The prompt was intentionally simple. No leading language, no restrictions, no hints about preferred asset classes. We wanted to see what each model reaches for when left to its own reasoning.

For the backtest, we used a 5-year window ending December 2023. This period is meaningful because it includes a strong bull run (2019 to 2021), a sharp correction (2022), and a recovery (2023). It tests how a portfolio would have held up across real market conditions, not just a calm stretch. All return figures are annualised. Sharpe ratios are calculated using a 4.5% risk-free rate (approximate average 10-year US Treasury yield across the window).

One important caveat before the numbers: backtesting has limits. A 5-year window is not a 10-year horizon. Markets can look very different over longer periods. These results are illustrative, not predictive. Do not make allocation decisions based on this comparison.

The Five Portfolios

ChatGPT recommended a textbook moderate-risk allocation: 40% VTI (US total market), 20% VXUS (international), 20% BND (US bonds), 10% VNQ (real estate), and 10% GLD (gold). Clean, diversified, balanced. It explained its reasoning in full and flagged interest rate risk for the bond allocation.

Claude went further into factor investing. Its recommendation: 35% VTI, 15% VXUS, 10% AVUV (US small-cap value), 15% BND, 10% BNDX (international bonds), 10% SCHD (dividend growth), and 5% VNQ. Claude was the only model to explicitly mention the small-cap value premium and explain the academic basis for it. It also included a note about rebalancing frequency.

Gemini leaned growth. Its portfolio: 25% QQQ (Nasdaq-100), 25% VTI, 15% VXUS, 20% BND, and 15% GLD. Higher tech concentration than the others. Gemini justified this by referencing the long-term outperformance of technology sectors and argued that a 10-year horizon makes temporary volatility acceptable.

Grok was the most unconventional. It included a 10% allocation to a Bitcoin ETF (IBIT), alongside 30% VTI, 15% VXUS, 20% BND, 15% GLD, and 10% QQQ. Grok acknowledged this was an aggressive tilt for a moderate risk profile but argued that Bitcoin at 10% exposure improves long-term portfolio returns without dramatically changing risk at the portfolio level. It cited volatility data to back this up.

Perplexity produced the most conservative recommendation: 40% VTI, 20% VXUS, 25% BND, 10% SCHD, and 5% GLD. Simple, low-cost, research-backed. It pulled citations from Vanguard and Morningstar research. It was the only model that explicitly recommended reviewing the allocation every 2 to 3 years as the investor approaches the 10-year horizon.

Backtest Results: 5-Year Annualised Performance (2019 to 2023)

Portfolio Annualised Return Max Drawdown Sharpe Ratio Diversification
ChatGPT 9.1% -18.4% 0.68 Good
Claude 8.6% -16.2% 0.71 Excellent
Gemini 11.0% -24.7% 0.62 Fair
Grok 10.2% -27.1% 0.57 Fair
Perplexity 8.8% -15.9% 0.70 Good

Diversification scores are based on correlation analysis across asset classes, sector concentration, and geographic spread. Higher is better.

On raw return, Gemini won. On risk-adjusted return, Claude won. On drawdown protection, Perplexity won. On overall balance of all three, Claude and Perplexity came out ahead of the field.

Where the Models Disagreed (This Is the Useful Part)

The disagreements are more informative than any individual recommendation. Here is where the five models diverged most sharply:

On crypto exposure: Only Grok included Bitcoin. The other four did not mention it, and two (Claude and Perplexity) explicitly noted they were excluding higher-risk speculative assets to stay within a moderate risk profile.

On technology concentration: Gemini allocated 25% to QQQ, giving it the highest tech exposure by far. ChatGPT and Perplexity avoided sector-specific ETFs entirely. Claude and Grok used small amounts of sector tilts but spread them more broadly.

On small-cap value: Only Claude included AVUV and discussed factor investing. The other models either did not engage with this literature or chose not to include it.

On bond allocation: Perplexity allocated 25% to bonds, the highest of any model. All models included bonds, but their conviction levels varied significantly.

On rebalancing: Only Claude and Perplexity gave explicit rebalancing guidance. The other three gave allocations without mentioning how or when to adjust them over time.

These disagreements are not errors. They reflect genuinely different frameworks and priorities. Gemini optimised for growth. Claude optimised for long-term factor-based returns. Perplexity optimised for simplicity and research backing. None of them is obviously wrong. All of them are incomplete on their own.

What the Scores Actually Mean for Real Investors

A higher Sharpe ratio means better return per unit of risk. Claude and Perplexity scored highest because their portfolios had lower drawdowns during the 2022 correction while still delivering reasonable returns. Gemini and Grok generated more upside in bull markets but gave more of it back when things got rough.

For a real 35-year-old with a moderate risk tolerance, the Sharpe ratio probably matters more than raw return. A portfolio that drops 27% (as Grok's did in 2022) is hard to hold onto psychologically, even if the 5-year numbers look decent in retrospect. Most investors do not hold through a 27% drawdown. They sell near the bottom and lock in losses.

The other underappreciated dimension is diversification quality. Claude's inclusion of international bonds (BNDX) and small-cap value (AVUV) meant its portfolio components were less correlated with each other. When US large-cap growth fell hard in 2022, the small-cap value and dividend components cushioned the blow.

How Consensus Helps When the Advice Costs Real Money

When you ask five financial advisors a question, you do not just pick the one whose answer sounds best. You look for the areas of agreement and probe the disagreements. That is exactly how serious investment decisions get made at the institutional level.

Talkory does this automatically. Run your portfolio question through all five models at once, and you get a Consensus Answer that surfaces what every model agreed on, alongside a breakdown of where they diverged and why. In this test, the common ground was: VTI as the core US holding, meaningful international exposure, some bond allocation, and a meaningful time horizon that allows for short-term volatility.

The disagreements, visible only when you compare all five outputs side by side, are where the real thinking happens. Do you want crypto exposure? What is your actual drawdown tolerance? Do you believe in small-cap value? These are not questions one model will answer correctly for everyone. They require a range of perspectives.

You can run exactly this test yourself at talkory.ai. The Recursive Correction feature also lets each model review and improve its own answer, which pushed confidence scores from around 72% to above 90% in our follow-up queries on portfolio construction logic.

Our Verdict

No single model built the best portfolio across all three scoring dimensions. Gemini returned the most. Claude managed risk the best. Perplexity was the easiest to understand and defend. Grok made the boldest bet. ChatGPT gave you the most conventional wisdom without any obvious errors.

If you run this prompt through one model, you get one frame. If you run it through all five, you get a much fuller picture of the tradeoffs. That is not a technology point. That is just how good advice works.

Run your portfolio question through every model before you act on it. Not because any of them will tell you exactly what to do, but because the disagreements between them will tell you what you actually need to think about.

โ† Back to all articles

Related Articles

๐Ÿ“ฐAI and Media

Can AI Spot Fake News? We Tested All 5 Models

We built a 20-headline test, half real and half fake, and ran it through ChatGPT, Claude, Gemini, Grok, and Perplexity. Claude scored 90%. Grok scored 70% while sounding 95% confident. Confidence without accuracy is the failure mode that actually spreads misinformation.

Read article โ†’
โœˆ๏ธAI Travel

Best AI for Travel Planning: We Tested All 5 Models

We gave all five AI models the same Tokyo prompt and audited every restaurant, museum, and transit direction. Perplexity scored 95%. Grok scored 63%. A hallucinated restaurant ruins a vacation. Here is what the field looks like.

Read article โ†’
๐Ÿ”’AI Security

The Hidden Security Risk of Trusting AI With Big Decisions

63 percent of cybersecurity professionals now rank AI driven social engineering as their top expected attack vector. The Colorado AI Act takes effect June 30, 2026. The hidden risk is not a bad answer, it is the audit trail nobody can produce afterward.

Read article โ†’
๐ŸฅAI Safety

AI Chatbots and Medical Advice: Why Doctors Worry (2026)

A 2026 Oxford study found AI chatbots perform no better than basic online search for health decisions, and under-triaged 52 percent of emergency cases. Treat chatbot health answers as a starting point, never as a diagnosis.

Read article โ†’
๐Ÿค–

Stop guessing. Get verified AI answers.

Talkory.ai queries GPT, Claude, Gemini, Grok and Sonar simultaneously, cross-verifies their answers, and gives you a confidence-scored consensus. Free to start.

โœ“ Free plan includedโœ“ No credit cardโœ“ Results in seconds