Doge Does Research #2 - LLMs and Comprehensiveness - a year later

Reflecting on the state of LLM comprehensiveness one year after our first experiment
Share

About a year ago, we ran a simple experiment - we asked an LLM to find all unicorn companies in India, and measured how complete the answer was. The results were humbling enough to write about - even with web search switched on and the right sources available, the model returned a materially incomplete list. The conclusion then was straightforward - if you need a comprehensive list, give the model the data; don't let it go looking.

Since then, model capabilities have marched on. The leading labs have all shipped newer, supposedly more capable models. So we wanted to know: how far have we come? and this year, we're back with a variation of the same question as last year.

We gave models from three frontline AI labs - Google Gemini, OpenAI GPT and Anthropic Sonnet and also Manus (probably the best browser use agent) - the same task, and compared their outputs against a verified master list of 97 active unicorns compiled from multiple sources. The prompt was deliberately unglamorous:

"Get me a list of all unicorn companies in India as of date, excluding companies that have gone public, got acquired or became defunct. Return your results in one column with company name."

No documents provided. No RAG. No hand-holding on sources. Models allowed to use the web. This becase a test of what each model can do to search and stitch together data from multiple sources on the internet.

Unicorn Identification by Model

We asked four different models to identify unicorn companies in India, and compared their results against a master list compiled from multiple sources.

Model Performance Comparison

LabModelUnicorn Count
Google Geminigemini3.5-flash86
OpenAIGPT-5.5-Instant64
AnthropicSonnet4.679
ManusManus 1.6 Lite99
Master List-97

LLM Output Comparison Table

Lab: Google GeminiLab: OpenAILab: AnthropicLab: ManusMaster List
Model: gemini3.5-flashModel: GPT-5.5-InstantModel: Sonnet4.6Model: Manus 1.6 LiteModel: n.a.
Unicorn Count: 86Unicorn Count: 64Unicorn Count: 79Unicorn Count: 99Unicorn Count: 97
Company NameCompany NameCompany NameCompany NameCompany Name
AckoAcko5ireAcko5ire
ApnaAther EnergyAckoAmagiAcko
BharatPeBlackBuckBharatPeApnaApna
BillDeskBlueStoneBiocon BiologicsBharatPeBharatPe
BoAtBoatBlackBuckBillDeskBillDesk
BrowserStackBrowserStackBrowserStackBlackBuckBoAt
CarDekhoChargebeeBYJU'SBoAtBrowserStack
Cars24CREDCARS24BrowserStackCardekho
ChargebeeDailyhuntCoinDCXCREDCARS24
CoinDCXDarwinboxCREDCardekhoChargeBee
CoinSwitchDealShareDarwinboxCars24CitiusTech
CREDDream11DealShareChargeBeeCoinDCX
Cult.fitElasticRunRaise Financial ServicesCitiusTechCoinSwitch
DailyhuntEruditusDream11CoinDCXCRED
DarwinboxFirstCryDroolsCoinSwitchCult.fit
DealShareFive Star Business FinanceEruditusCult.fitDailyhunt
Dream11Fractal AnalyticsEV Co (Mahindra)DailyhuntDarwinbox
DroolsFreshToHomeFarEyeDarwinBoxDealShare
DroomGames24x7Fireflies AIDealShareDream11
ElasticRunGlobalBeesGames24x7Dream11Drools
EruditusGrowwGlanceDroolsDroom
FractalInfra.MarketGlobalBeesDroomDruva
Games24x7InMobiGrowwDruvaElasticRun
GlanceInnovaccerInCred FinanceElasticRunEruditus
GlobalBeesJuspayInfra.MarketEruditusFarEye
Good Glamm Group (MyGlamm)KreditBeeJuspayFireflies AIFireflies AI
GupshupLEAD SchoolKhatabookFractalGames24x7
HasuraLiciousKreditBeeGames24x7Glance
IcertisLivspaceLeadSquaredGlanceGlobalBees
InCred FinanceMeeshoLenskartGlobalBeesGupshup
Infra.marketMensa Brands (BRND.ME)LivspaceGrowwHasura
InMobiMoEngageMahindra Electric AutomobileGupshupIcertis
InnovaccerMolbio DiagnosticsMeeshoHasuraInCred Finance
JSW One PlatformsMoney ViewMensa Brands (BRND.ME)IcertisInfra.Market
JumbotailNaviMobile Premier LeagueInCred FinanceInMobi
JuspayNeysaMoglixInMobiInnovaccer
KreditBeeNoBrokerMoney ViewInfra.MarketJumbotail
Ola KrutrimNSE (National Stock Exchange)NetradyneInnovaccerJuspay
LEAD SchoolOfBusinessNeysaJSW One PlatformsKhatabook
LeadSquaredOla CabsOfBusinessJumbotailKreditBee
LiciousOla ElectricOla CabsJuspayLEAD School
LivspaceOpen Financial TechnologiesOla KrutrimKreditBeeLeadSquared
Mensa Brands (BRND.ME)OYO RoomsOneCardKrutrim SI DesignsLicious
MindTicklePhysicsWallahOpen Financial TechnologiesLEAD SchoolLivspace
Mobile Premier LeaguePine LabsOxyzoLeadSquaredMensa Brands (BRND.ME)
MoglixPocket FMOYO RoomsLenskartMindTickle
Molbio DiagnosticsPostmanPerfiosLiciousMobile Premier League
Money ViewPurpllePhonePeLivSpaceMoEngage
Mu SigmaRazorpayPhysics WallahMeeshoMoglix
NeysaShareChatPine LabsMensa Brands (BRND.ME)Molbio Diagnostics
NoBrokerSkyroot AerospacePocket FMMindTickleMoney View
OfBusinessSpinnyPolygonMobile Premier LeagueMu Sigma
Ola CabsStashfinPorterMoglixNavi
OneCardSwiggy InstamartPristyn CareMolbio DiagnosticsNetradyne
Open Financial TechnologiesUnacademyPurplleMoney ViewNeysa
OxyzoUpstoxRaise Financial ServicesMu SigmaNoBroker
OYO RoomsUrban CompanyRapidoMyGlammOfBusiness
PerfiosVedantuRazorpayNetradyneOla Cabs
PhonePeWhatfixRebel FoodsNeysaOla Electric
PhysicsWallahXpressBeesReliance JioNoBrokerOla Krutrim
Pocket FMYubi (CredAvenue)Reliance RetailOfBusinessOneCard
PorterZeptoRivigoOlaOpen Financial Technologies
PostmanZetwerkShareChatOla KrutrimOxyzo
Pristyn CareZohoShiprocketOneCardOYO Rooms
PurplleSkyroot AerospaceOpenPerfios
Raise Financial ServicesSliceOYOPharmEasy
RapidoSnapdealOxyzoPhonePe
RazorpayStaq (Innovaccer)PerfiosPocket FM
Rebel FoodsJumbotailPharmEasyPolygon
ShareChatUdaanPine LabsPorter
ShiprocketUnacademyPolygonPostman
Skyroot AerospaceUpGradPorterPristyn Care
sliceUpstoxPostmanPurplle
SpinnyVedantuPristyn CareRaise Financial Services
TurtlemintDailyhuntPurplleRapido
UdaanYubi (CredAvenue)Raise FinancialRazorpay
UnacademyZeptoRapidoRebel Foods
UniphoreZetwerkRazorPaySarvam AI
upGradRebel FoodsShareChat
UpstoxRivigoShiprocket
VedantuShareChatSkyroot Aerospace
XpressbeesShiprocketSlice
Yubi (CredAvenue)Skyroot AerospaceSpinny
ZenotiSliceStaq (Innovaccer)
ZetaSpinnyStashfin
ZetwerkUdaanUdaan
UnacademyUniphore
UniphoreUpGrad
upGradUpstox
UpstoxVedantu
Urban CompanyWhatfix
VedantuXpressbees
XpressbeesYubi (CredAvenue)
Yubi (CredAvenue)Zenoti
ZenotiZepto
ZeptoZeta
ZerodhaZetwerk
Zeta
Zetwerk
Zoho

**Variations in names of unicorns, returned by the models, have been normalized using the master list to allow for cross list comparison. For example, Krutrim or Krutrim SI have been renamed to Ola Krutrim. You can see the raw and normalized data spreadsheet here

Analysis of Results

The unicorn count alone is a misleading headline. A model can return a high number by padding its list with false positives, or a low number by being conservative. What matters is how many of the 97 verified unicorns each model actually identified - and how many entries it included that had no business being there. The table below adds that lens.

LabModelUnicorn CountAccuracy vs Master List
Google Geminigemini3.5-flash8682.5% (80/97) - returned 86 names, 6 were false positives
OpenAIGPT-5.5-Instant6449.5% (48/97) - returned 64 names, missed 49 real unicorns, 16 were false positives
AnthropicSonnet4.67964.9% (63/97) - returned 79 names, missed 35 real unicorns, 16 were false positives
ManusManus 1.6 Lite9983.5% (81/97) - returned 99 names, missed 16 real unicorns, 19 were false positives
Master List-97100%

Missing Companies

When measured against the Master List (97 unicorns), identification gaps are more severe than the headline unicorn counts suggest. The "Accuracy" figure represents how many of the 97 verified unicorns each model actually identified - but the misses tell a sharper story:

  • OpenAI (GPT-5.5-Instant) missed 49 companies - exactly half the master list. The omissions aren't just obscure names; they include well-established players like BharatPe, CoinDCX, PhonePe, Rebel Foods, Shiprocket, and Udaan. At 49.5% accuracy, this is by far the weakest recall of the four models tested. The gap between its stated count (64) and its actual verified hits (48) is stark - nearly a quarter of what it returned had no business being on the list.
  • Anthropic (Sonnet 4.6) missed 35 companies despite appearing mid-table by count. It dropped well-known names like BoAt, InMobi, Postman, and Zeta, while simultaneously inflating its list with false positives. At 64.9% accuracy, the gap between its unicorn count (79) and verified hits (63) is the starkest precision-recall mismatch of any model - it substituted invented entries for real ones.
  • Google Gemini (gemini3.5-flash) missed 17 companies, achieving 82.5% accuracy - the second-best recall of the four. Its blind spots skewed toward newer or less-covered names: 5ire, CitiusTech, Druva, Khatabook, MoEngage, Navi, and Sarvam AI were all absent. The omissions feel like genuine knowledge gaps rather than structural failures.
  • Manus (Manus 1.6 Lite) achieved the best recall at 83.5%, missing only 16 companies - but its misses still included some notable ones: FarEye, MoEngage, Navi, PhonePe, Pocket FM, Sarvam AI, and Stashfin all slipped through despite Manus's active web-searching approach. No model got a clean sheet.

The "Extra" Problem (False Positives)

Comprehensiveness isn't just about recall - precision matters equally. Several models padded their lists with companies that a careful reading of the prompt should have excluded. Across all four models, the false positives fall into four distinct categories, each revealing a different kind of model failure.

Type 1 - Already Public - The prompt explicitly excluded companies that have gone public. Yet every model included at least some IPO graduates. Groww, Meesho, Ather Energy, BlueStone, FirstCry, PhysicsWallah, Urban Company, Fractal, etc. have all listed on public markets. This is arguably the most straightforward error - these exits are heavily covered events - and yet it was the most common category of mistake across all four labs.

Type 2 - Defunct, Acquired or Written Down - BYJU'S, Rivigo, and MyGlamm are companies whose unicorn status has effectively ceased to exist - through insolvency proceedings, operational shutdown, or deep valuation write-downs that have been extensively reported. Unacademy just got acquired. Yet, their continued appearance on model-generated lists suggests that models learn the "unicorn moment" of a company but struggle to unlearn it. Anthropic was the most exposed here, listing all three; Manus included two.

Type 3 - Corporate Subsidiaries, Not Startup Unicorns - JSW One Platforms, Reliance Jio, Reliance Retail, Biocon Biologics, Tata Passenger Electric Mobility, etc. - these are arms of large, established conglomerates - not exactly venture-backed startups that independently crossed the $1B valuation mark. The unicorn designation is specifically a startup construct, and including subsidiaries of the Reliance, Tata, JSW, or Mahindra groups reflects a category confusion that a well-calibrated model should avoid. Anthropic was most prone to this error, accounting for the majority of conglomerate entries across the test.

Type 4 - Bootstrapped and Proud of It - Zoho & Zerodha are special cases. They are valued well above $1B - but they have never raised external venture capital. Whether they "qualify" as a unicorn is a definitional debate, but their inclusion in the lists of Manus and OpenAI suggests models are pattern-matching on valuation alone, ignoring the funding-lineage dimension of the term.

Breaking down the extras by lab:

  • Manus had the most extras (19), with the largest share being public companies (Amagi, Groww, Lenskart, Meesho, Urban Company) alongside two defunct entries (MyGlamm, Rivigo), two bootstrapped (Zerodha, Zoho), and one conglomerate arm (JSW One Platforms). More searches, it turns out, also surface more noise.
  • OpenAI added 16 extras - the broadest categorical spread of any model - including 10 public companies (Ather Energy, BlueStone, FirstCry, Fractal Analytics, Groww, Meesho, PhysicsWallah, Urban Company, Five Star Business Finance, Pine Labs, Swiggy Instamart), one bootstrapped (Zoho), one conglomerate (NSE), and a soonicorn? (FreshToHome).
  • Anthropic Sonnet contributed 16 extras, concentrated in two categories: public companies (Groww, Lenskart, Meesho, PhysicsWalla) and conglomerate subsidiaries (Reliance Jio, Reliance Retail, Biocon Biologics, EV Co (Mahindra), Mahindra Electric Automobile, Tata Passenger Electric Mobility) - the latter being the most distinctive failure pattern of any model in this test. Also notable: two defunct ones Snapdeal and BYJU'S appearing on its list.
  • Gemini had the fewest extras at 6, and its errors were the most forgivable: 2 public companies (Fractal, PhysicsWallah), one defunct (MyGlamm), and one conglomerate arm (JSW One Platforms). Turtlemint - a well-funded insurtech that has been on the cusp of unicorn status for some time - is the one genuine soonicorn in the mix. Gemini's low false positive rate, combined with its strong recall, makes it the most precise model in this test and essentially ties with Manus on overall accuracy (82.5% vs 83.5%) despite using no agentic web search.

Patterns in LLM Failure

  1. Valuation Staleness: The presence of BYJU'S, Rivigo, and MyGlamm across multiple models is the clearest evidence of temporal lag. These are not obscure edge cases - their collapses were front-page news. Models appear to index the peak of a company's story far more reliably than its decline.
  2. Prompt Filtering Failures: Every model failed on at least one explicit exclusion criterion in the prompt - "gone public," "acquired," or "defunct." The models understood the task conceptually but couldn't apply the filters consistently against their own internal knowledge.
  3. The "Specialised SaaS" Blindspot: All four models showed higher consistency on consumer-facing names (Dream11, OYO, Razorpay) than on B2B or deep-tech unicorns. CitiusTech, FarEye, Icertis, Netradyne, and Sarvam AI were among the most frequently missed - suggesting these companies simply don't appear often enough in the training corpus to register reliably when the model is constructing a list from memory.
  4. The Agentic Tradeoff: Manus's web-search-augmented approach achieved the best raw recall but also generated the most false positives and naming artefacts. The more striking finding is that Gemini - using live search - matched Manus almost exactly on accuracy (82.5% vs 83.5%), while producing far fewer false positives.

Share