We Asked 5 AI Models to Build a $10K Portfolio. Here Is What Happened.
Most people using AI for financial planning are doing something quietly risky: they are asking one model, trusting one answer, and acting on it. The problem is not that AI investment advice is always wrong. The problem is that you have no way of knowing when it is wrong unless you check it against something else.
So we ran a straightforward test. We gave the same prompt to ChatGPT, Claude, Gemini, Grok, and Perplexity: "Build a $10,000 portfolio for a 35-year-old with moderate risk tolerance, a 10-year investment horizon, and no restrictions on ETF selection."
Five models. Same question. Then we backtested the resulting portfolios over a 5-year window (January 2019 to December 2023) using publicly available historical performance data, and scored each portfolio on three dimensions: total return, diversification quality, and risk-adjusted return (Sharpe ratio). The results were not what we expected.
What We Tested and How
The prompt was intentionally simple. No leading language, no restrictions, no hints about preferred asset classes. We wanted to see what each model reaches for when left to its own reasoning.
For the backtest, we used a 5-year window ending December 2023. This period is meaningful because it includes a strong bull run (2019 to 2021), a sharp correction (2022), and a recovery (2023). It tests how a portfolio would have held up across real market conditions, not just a calm stretch. All return figures are annualised. Sharpe ratios are calculated using a 4.5% risk-free rate (approximate average 10-year US Treasury yield across the window).
One important caveat before the numbers: backtesting has limits. A 5-year window is not a 10-year horizon. Markets can look very different over longer periods. These results are illustrative, not predictive. Do not make allocation decisions based on this comparison.
The Five Portfolios
ChatGPT recommended a textbook moderate-risk allocation: 40% VTI (US total market), 20% VXUS (international), 20% BND (US bonds), 10% VNQ (real estate), and 10% GLD (gold). Clean, diversified, balanced. It explained its reasoning in full and flagged interest rate risk for the bond allocation.
Claude went further into factor investing. Its recommendation: 35% VTI, 15% VXUS, 10% AVUV (US small-cap value), 15% BND, 10% BNDX (international bonds), 10% SCHD (dividend growth), and 5% VNQ. Claude was the only model to explicitly mention the small-cap value premium and explain the academic basis for it. It also included a note about rebalancing frequency.
Gemini leaned growth. Its portfolio: 25% QQQ (Nasdaq-100), 25% VTI, 15% VXUS, 20% BND, and 15% GLD. Higher tech concentration than the others. Gemini justified this by referencing the long-term outperformance of technology sectors and argued that a 10-year horizon makes temporary volatility acceptable.
Grok was the most unconventional. It included a 10% allocation to a Bitcoin ETF (IBIT), alongside 30% VTI, 15% VXUS, 20% BND, 15% GLD, and 10% QQQ. Grok acknowledged this was an aggressive tilt for a moderate risk profile but argued that Bitcoin at 10% exposure improves long-term portfolio returns without dramatically changing risk at the portfolio level. It cited volatility data to back this up.
Perplexity produced the most conservative recommendation: 40% VTI, 20% VXUS, 25% BND, 10% SCHD, and 5% GLD. Simple, low-cost, research-backed. It pulled citations from Vanguard and Morningstar research. It was the only model that explicitly recommended reviewing the allocation every 2 to 3 years as the investor approaches the 10-year horizon.
Backtest Results: 5-Year Annualised Performance (2019 to 2023)
| Portfolio | Annualised Return | Max Drawdown | Sharpe Ratio | Diversification |
|---|---|---|---|---|
| ChatGPT | 9.1% | -18.4% | 0.68 | Good |
| Claude | 8.6% | -16.2% | 0.71 | Excellent |
| Gemini | 11.0% | -24.7% | 0.62 | Fair |
| Grok | 10.2% | -27.1% | 0.57 | Fair |
| Perplexity | 8.8% | -15.9% | 0.70 | Good |
Diversification scores are based on correlation analysis across asset classes, sector concentration, and geographic spread. Higher is better.
On raw return, Gemini won. On risk-adjusted return, Claude won. On drawdown protection, Perplexity won. On overall balance of all three, Claude and Perplexity came out ahead of the field.
Where the Models Disagreed (This Is the Useful Part)
The disagreements are more informative than any individual recommendation. Here is where the five models diverged most sharply:
On crypto exposure: Only Grok included Bitcoin. The other four did not mention it, and two (Claude and Perplexity) explicitly noted they were excluding higher-risk speculative assets to stay within a moderate risk profile.
On technology concentration: Gemini allocated 25% to QQQ, giving it the highest tech exposure by far. ChatGPT and Perplexity avoided sector-specific ETFs entirely. Claude and Grok used small amounts of sector tilts but spread them more broadly.
On small-cap value: Only Claude included AVUV and discussed factor investing. The other models either did not engage with this literature or chose not to include it.
On bond allocation: Perplexity allocated 25% to bonds, the highest of any model. All models included bonds, but their conviction levels varied significantly.
On rebalancing: Only Claude and Perplexity gave explicit rebalancing guidance. The other three gave allocations without mentioning how or when to adjust them over time.
These disagreements are not errors. They reflect genuinely different frameworks and priorities. Gemini optimised for growth. Claude optimised for long-term factor-based returns. Perplexity optimised for simplicity and research backing. None of them is obviously wrong. All of them are incomplete on their own.
What the Scores Actually Mean for Real Investors
A higher Sharpe ratio means better return per unit of risk. Claude and Perplexity scored highest because their portfolios had lower drawdowns during the 2022 correction while still delivering reasonable returns. Gemini and Grok generated more upside in bull markets but gave more of it back when things got rough.
For a real 35-year-old with a moderate risk tolerance, the Sharpe ratio probably matters more than raw return. A portfolio that drops 27% (as Grok's did in 2022) is hard to hold onto psychologically, even if the 5-year numbers look decent in retrospect. Most investors do not hold through a 27% drawdown. They sell near the bottom and lock in losses.
The other underappreciated dimension is diversification quality. Claude's inclusion of international bonds (BNDX) and small-cap value (AVUV) meant its portfolio components were less correlated with each other. When US large-cap growth fell hard in 2022, the small-cap value and dividend components cushioned the blow.
How Consensus Helps When the Advice Costs Real Money
When you ask five financial advisors a question, you do not just pick the one whose answer sounds best. You look for the areas of agreement and probe the disagreements. That is exactly how serious investment decisions get made at the institutional level.
Talkory does this automatically. Run your portfolio question through all five models at once, and you get a Consensus Answer that surfaces what every model agreed on, alongside a breakdown of where they diverged and why. In this test, the common ground was: VTI as the core US holding, meaningful international exposure, some bond allocation, and a meaningful time horizon that allows for short-term volatility.
The disagreements, visible only when you compare all five outputs side by side, are where the real thinking happens. Do you want crypto exposure? What is your actual drawdown tolerance? Do you believe in small-cap value? These are not questions one model will answer correctly for everyone. They require a range of perspectives.
You can run exactly this test yourself at talkory.ai. The Recursive Correction feature also lets each model review and improve its own answer, which pushed confidence scores from around 72% to above 90% in our follow-up queries on portfolio construction logic.
Our Verdict
No single model built the best portfolio across all three scoring dimensions. Gemini returned the most. Claude managed risk the best. Perplexity was the easiest to understand and defend. Grok made the boldest bet. ChatGPT gave you the most conventional wisdom without any obvious errors.
If you run this prompt through one model, you get one frame. If you run it through all five, you get a much fuller picture of the tradeoffs. That is not a technology point. That is just how good advice works.
Run your portfolio question through every model before you act on it. Not because any of them will tell you exactly what to do, but because the disagreements between them will tell you what you actually need to think about.