Final week, our lead software program engineer, Nelson Masuki and I introduced on the MSRA Annual Convention to a room stuffed with sensible researchers, information scientists, and improvement practitioners from throughout Kenya and Africa. We have been there to handle a quietly rising dilemma in our area: the rise of artificial information and its implications for the way forward for analysis, significantly within the areas we serve.
Our presentation was anchored in findings from our whitepaper that in contrast outcomes from a conventional CATI survey information with artificial outputs generated utilizing a number of giant language fashions (LLMs). The session was a mixture of curiosity, concern, and significant considering, particularly once we demonstrated how off-the-mark artificial information will be in locations the place cultural context, language, or floor realities are advanced and quickly altering.
We began the presentation by asking everybody to immediate their favorite AI app with some precise inquiries to mannequin survey outcomes. No two folks within the corridor obtained the identical solutions. Though the immediate was precisely the identical, and many individuals used the identical apps on the identical fashions, difficulty one.
The experiment
We then introduced the findings from our experiments. Beginning with a CATI survey of over 1,000 respondents in Kenya, we performed a 25-minute examine on a number of areas: meals consumption, media and know-how use, information and attitudes towards AI, and views on humanitarian help. We then took the respondents’ demographic data (age, gender, rural-urban setting, training degree, and ADM1 location) and created artificial information respondents (SDRs) that precisely matched these respondents, and administered the identical questionnaire throughout a number of LLMs and fashions (even did repeat cycles with newer, extra superior fashions). The variations have been as different as they have been skewed – nearly all the time flawed. Artificial information failed the one true take a look at of accuracy – the genuine voice of the folks.
Many within the room had confronted the identical stress: international funding cuts, growing calls for for pace, and now, the attract of AI-generated insights that promise “simply pretty much as good” with out ever leaving a desk. However for these of us grounded within the realities of Africa, Asia, and Latin America, the concept of simulating the reality, of changing actual folks with probabilistic patterns, doesn’t sit proper.
This dialog, and others we had all through the convention, affirmed a rising reality – AI will undoubtedly form the way forward for analysis, however it should not exchange actual human enter. At the very least not but, and never within the elements of the world the place reality on the bottom doesn’t stay in neatly labeled datasets. We can’t mannequin what we’ve by no means measured.
Why Artificial Knowledge Can’t Substitute Actuality – But
Artificial information is precisely what it feels like: information that hasn’t been collected from actual folks, however generated algorithmically based mostly on what fashions assume the solutions ought to be. Within the analysis world, this usually includes creating simulated survey responses based mostly on patterns recognized from historic information, statistical fashions, or giant language fashions (LLMs). Whereas artificial information can function a useful testing instrument, and we’re regularly testing its utility in managed experiments, it nonetheless falls quick in a number of vital areas: it lacks floor reality, it missed nuance and context, and subsequently it’s laborious to belief.
And that’s exactly the issue.
In our side-by-side comparability of actual survey responses and artificial responses generated by way of LLMs, the variations weren’t delicate – they have been foundational. The fashions guessed flawed on main indicators like unemployment ranges, digital platform utilization, and even easy family demographics.
I don’t imagine that is only a statistical difficulty. It’s a context difficulty. In areas resembling Africa, Asia, and Latin America, floor realities change quickly. Behaviors, opinions, and entry to companies are extremely native and deeply tied to tradition, infrastructure, and lived expertise. These are usually not issues a language mannequin skilled predominantly on Western web content material can intuit.
Artificial information can, certainly, be used
Artificial information isn’t inherently dangerous. Lest you assume we’re anti-tech (which we will by no means be accused of), at GeoPoll, we do use artificial information, simply not as a substitute of actual analysis. We use it to check survey logic and optimize scripts earlier than fieldwork, simulate potential outcomes and spot logical contradictions in surveys, and experiment with framing by operating parallel simulations earlier than information assortment.
And sure, we may generate artificial datasets from scratch. With greater than 50 million accomplished surveys throughout rising markets, our dataset is arguably some of the consultant foundations for localized modeling.
Nevertheless, we’ve additionally examined its limits, and the findings are clear: artificial information can’t exchange actual, human-sourced insights in low-data environments. We don’t imagine it’s moral or correct to interchange fieldwork with simulations, particularly when selections about coverage, funding, or support are at stake. Artificial information has its place. However in our view, it’s not, and shouldn’t be, a shortcut for understanding actual folks in underrepresented areas. It’s a instrument to increase analysis, not a substitute for it.
Knowledge Fairness Begins with Inclusion – GeoPoll AI Knowledge Streams
There’s a major motive this issues. Whereas some are racing to construct the following giant language mannequin (LLM), few are asking: What information are these fashions skilled on? And who will get represented in these datasets?
GeoPoll is on this area, too. We now work with tech firms and analysis establishments to offer high-quality, consented information from underrepresented languages and areas, information used to coach and fine-tune LLMs. GeoPoll AI Knowledge Streams is designed to fill the gaps the place international datasets fall quick – to assist construct extra inclusive, consultant, and correct LLMs that perceive the contexts they search to serve.
As a result of if AI goes to be actually international, it must be taught from all the globe, not simply guess. We should be certain that the voices of actual folks, particularly in rising markets, form each selections and the applied sciences of tomorrow.
Contact us to be taught extra about GeoPoll AI Knowledge Streams and the way we use AI to energy analysis.