The most popular and comprehensive Open Source ECM platform
Machine Learning: Stochastic Parrots that Mimic Social Biases
Machine Learning is based on the processing of enormous amounts of data. That processing takes time. A common trick to speed up machine learning and still achieve good results is to use something called ‘transfer learning’. What this does is, rather than start from scratch, a generic ‘pre-trained’ off-the-shelf model can be plugged in as a starting point for a new NLP project.
A problem with using canned models is that the output of Machine Learning is totally dependent on the data that it is trained on, and until very recently, little or no consideration has been given to whether or not training data contain biases. Many of thecommon pre-trained text data sets used for natural language ML models were collected from very narrow populations. For example, the NLP model used by GPT-2 is trained on the page links from Reddit articles, and Reddit user base is skewed towards young males.
Test output from NLP models is highly reflective of the content and biases that are embodied by the training data. As a result, some have pointed to NLP models as Stochastic Parrots — software that mimics the content and biases of the content that trained it.
Why is it important to recognize NLP models often just repackage the content that they were trained on? The problem is that these models will ultimately perpetuate the same biases and perspectives found in the content of the original training data. Jennifer Redmon, data evangelist at Cisco, wrote for Forbes that “if bigotry isn’t parroted to an individual by a language model, that individual is much less likely to become a bigot. If empathy, diversity, equity, justice, inclusion and respect are parroted by a language model, the same person is more likely to embody these human values.”
Esther Sánchez García and Michael Gasser, in an issue of Science for the People, commented that “an immediate action for NLP researchers is to cautiously consider the tradeoff between the energy required and expected benefits. Datasets and the systems trained on them must be carefully planned and documented, taking into account not simply the design of the technology, but also the people who will use it or may be affected by it. Rather than simply grabbing massive quantities of available data from the Internet, the researchers will need to dedicate considerable time to creating datasets that are appropriate for the tasks and, to the extent possible, free from the biases that are present in unfiltered data.”