Stochastic Parrots

Paul Taylor

As chest X-rays of Covid-19 patients began to be published in radiology journals, AI researchers put together an online database of the images and started experimenting with algorithms that could distinguish between them and other X-rays. Early results were astonishingly successful, but disappointment soon followed. The algorithms were responding not to signs of the disease, but to minor technical differences between the two sets of images, which were sourced from different hospitals: such things as the way the images were labelled, or how the patient was positioned in the scanner.

It’s a common problem in AI. We often refer to ‘deep’ machine learning because we think of the calculations as being organised in layers and we now use many more layers than we used to, but what is learned is nevertheless superficial. If a group announces that their algorithm can infer sexual orientation from photographs of faces, it may well be responding to differences between the ways people present themselves on dating sites. If a start-up boasts that it can identify criminality from photographs, it’s worth asking if it’s merely sorting police mug shots from Instagram selfies, or, worse, telling us that people with certain skin tones are more likely to get convicted.

Awareness of these problems, and their social consequences, has been growing. In 2019, an algorithm used to allocate healthcare resources in the US was found to be less likely to recommend preventative measures if the patient is black, because the algorithm is optimised to save costs and less money is spent treating black patients. Around the same time, Timnit Gebru, a leader of Google’s ‘ethical AI’ team and one of the few black women in a prominent role in the industry, demonstrated that commercially available face recognition algorithms are less effective when used by women, black people and, especially, black women, because they are underrepresented in the data the algorithms are trained on.

From the perspective of an individual researcher, the solution to these problems may be to try harder: to use data that is inclusive and metrics that aren’t discriminatory, to make sure that you understand, as best you can, what the algorithm is learning, and that it isn’t amplifying an existing injustice. To some extent, though, the problems are structural. One reason we don’t pay enough attention to the consequences that our algorithms have for women or ethnic minorities is that so few women and so few people of colour work in tech. Groups like Black in AI, co-founded by Gebru, have been set up to try improve the situation, but the barriers are significant.

AI’s problem with fairness is structural in another way. Much of the work done by small teams builds on huge datasets that were created by large collaborations or corporations. ImageNet, part funded by Google, contains the URLs of 14 million images allocated, by anonymous online workers, to more than 20,000 categories. Training algorithms to replicate this classification has been a key challenge in AI, and has done much to transform the field. Many algorithms developed for more specialist tasks take generic networks already trained on ImageNet as their starting point. Most of this research ignores the 2833 categories that deal with people, and it’s easy to see why: the four most populated categories are ‘gal’, ‘grandfather’, ‘dad’ and ‘chief executive officer’; a 2020 audit concluded that 1593 categories used ‘potentially offensive’ labels. Birhane and Prabhu report finding pornographic and non-consensual images in the collection. ‘Feeding AI systems on the world’s beauty, ugliness and cruelty,’ they write, ‘but expecting it to reflect only the beauty, is a fantasy.’

Perhaps more troubling than ImageNet is the development of large-scale language models, such as GPT-3, generated from petabytes of data harvested from the web. The scale of the model is incredible and its capacities are bewildering: one short video shows how to use it to create a kind of virtual accountant, a tool that, given half a dozen sentences describing a business, will generate a working spreadsheet for its transactions.

Gebru helped write a paper last year ‘On the Dangers of Stochastic Parrots’ which argues, among other criticisms, that much of the text mined to build GPT-3 comes from forums where the voices of women, older people and marginalised groups are under-represented, and these models will inevitably encode biases that will affect the decisions of the systems built on top of them.

The authors of ‘On the Dangers of Stochastic Parrots’ advocate ‘value sensitive design’: researchers should involve stakeholders early in the process and work with carefully curated, and smaller, datasets. The paper argues that the dominant paradigm in AI is fundamentally broken. Their prescription is not state regulation or better algorithms but, in effect, a more ethically grounded way of working. It is hard to see how this can brought about while the field is dominated by large, ruthless corporations, and recent events give little grounds for optimism.

Google had already circulated a memo calling on its researchers to ‘strike a positive tone’ in discussions of the company’s technology. A pre-publication check of the stochastic parrots paper seems to have alarmed the management. They suggested changes, but Gebru stood her ground and, according to Google, resigned in December. By Gebru’s account, she was sacked. Shortly afterwards her colleague Margaret Mitchell was suspended, allegedly for running scripts to search emails for evidence of discriminatory treatment of Gebru. One of the authors on the published preprint of the paper is named as ‘Shmargaret Shmitchell’ and an acknowledgment notes that some authors were required by their employers to remove their names.

There has been a response. More than a thousand Google employees have signed an open letter calling on the company to explain its treatment of Gebru. Workers have formed a trade union, at least partly in response to these events. A director and a software developer have resigned.


  • 13 February 2021 at 6:33pm
    Graucho says:
    There is an interesting trade off in predictive modelling called Breiman's law. It runs that along the lines of Interpretability*Accuracy=Breiman's constant. In short the more accurate the predictions the blacker the black box that produced them will be and the harder it will be to fathom why the algorithm used produced the results that it did. This created issues even in the early days with simple AI applications such as credit rating. Lenders had to be able to demonstrate that credit had not been refused on the grounds of race, sex, religion etc. so they had to know why the algorithm had made the decision it did. The ability of AI to inadvertently deprive citizens of their civil rights is there and growing. Cases challenging "Computer says no" should prove a nice little earner for the legal profession in the coming decade.

  • 14 February 2021 at 1:46pm
    M.G. Zimeta says:
    A helpful overview. The open letter is available here and non-Googlers can sign it too:
    It calls for more transparency from Google, and strengthened commitment to research integrity and academic freedom.

  • 17 February 2021 at 12:09am
    David Roser says:
    Thanks for the links to the questioning of current AI. I just love the metaphor 'Stochastic Parrots' because it captures a long standing flaw I continually see in current AI, that is the overemphasis on extreme empiricism in the absence of underlying theory. To be sure empirical methods are proving very useful e.g. medical imagery interpretation and can identify potential previously unappreciated. But this method seems to miss making links to how things really work in the world beyond human stereotyping. The following provide some views on the downsides of overemphasize empirical AI. The seminal AI and Pulitzer winning book,_Escher,_Bach was reviewed by LRB 40 years ago. Intriguingly the reviewer missed the deeper speculation of the book's last 4 chapters and Hofstadter's interest not so just in creating AI but understanding the nature of intelligence itself. (As an aside, today that reviewer though eminent in computing is better known for his contribution to what is known as the AI winter ). A more general warning against empiricism is provided by a one page mea culpa by Freeman Dyson - 2004. A meeting with Enrico Fermi. Nature, 427, 297-297. This is a delightful story of hubris followed by humility reflecting seduction by seemingly impressive empirical prediction which proved to be based on a poor understanding of physical reality. Finally there are the lessons provided by a different approach to AI, Bayesian Belief Nets cf. KORB, K. B. & NICHOLSON, A. E. 2011. Bayesian artificial intelligence, CRC press. This term Bayes Net was coined for Judea Pearl and for this he received the 2011 Turing Prize. Though this approach, like neural nets, also employs hidden/latent nodes its philosophy leads to not one but two approached to essentially 'AI' based inferences (prediction, backcasting, diagnosis etc). On one hand there are 'semi/naive' BNs which 'learn' empirically from large data sets guided by machine optimization. But there are also belief nets where the user guides network construction based on 'belief'. Belief may be the problematic expert opinion, but it can also be algorithms reflecting well researched hard science and hence biophysical reality as far as we can understand it. This may be less helpful with the social sciences and humanities where complexity continues to defeat systematization and generates a bag of contradictory roosters e.g. areas like economics which cloak value judgement concepts rationalization with a veneer of mathematics. But it does seem to offer a way by which better hypotheses in these fields may be generated and tested.

  • 20 February 2021 at 3:40pm
    brummagem galenus says:
    I note Google have terminated Dr Mitchell's employment.

    "Don’t be evil, and if you see something that you think isn’t right – speak up!”