Better algorithms and open data frameworks needed for inclusive and diverse online content
By: Arthur Gwagwa
Due to its unprecedented ability to continually learn from collected data, artificial intelligence (AI) technologies are transforming the publication of online content, for better or for worse. Today, the search engine algorithm decides the results of our searches (Nair, 2018) while AI-driven apps like Inkitt are connecting readers to authors. The effectiveness of such algorithms in mediating online content is largely dependent on a combination of better algorithms, comprehensive analytics but data availability and repetition based on reinforcement learning. Although African bloggers and content creators have benefitted from using multi-media digital platforms to create and distribute their work, cyberspace, in particular, the internet is still mostly dominated by the global north narrative. This is the case since the Global North engineers largely design algorithms that analyze, harvest, index, present and repurposes online data and the applications that run the computers, whilst Global North data scientists decide on the training data sets to apply in training the algorithms. The interaction between people and machines have also led to systems (social machines) that are blurring the lines between computational processes and human input with a direct impact on the publication of online content. Due to unequal access and connectivity to the internet, most of the data that feeds into computer analytics is generated in the global north too. Coupled with data collection and distribution frameworks that are not built towards healthy partnerships between industry and government, the northcetnric search and sorting algorithms may prevent the Global South countries from realizing the potential in their data. At a technical level, improved algorithms are needed for inclusive and diverse online content but equally important, at a regulatory level, more open data frameworks.
Imperfect but intelligent machines as arbiters of online content
As ‘smart machines’ are increasingly pervading almost every aspect of human existence, computational processes have increasingly replaced human agency, for instance, machine learning algorithms now classify knowledge: categorization and representation schemes; information retrieval, recommendation, classification thus creating their own culture. Although this has led to more efficient public decision making and implementation, such systems have also produced and reinforced discriminatory patterns in content inclusion and distribution. This may be unintentional, for example, it may be occasioned by the non-linear variables selection process in the algorithm design, and the output verification process. Relevant instances of discrimination in real life include:
- Such as when algorithms push a particular type of content to a certain class of online readers and clients, such as the recent case of the Netflix algorithm matching black customers with African content (Guardian, 2018)
- The discriminatory effect of word embedding-a class of natural language processing techniques that enable machines to sensibly use human language is quite effective at absorbing the accepted societal meaning of words (Caliskan et al., 2017).
The non-linear variables selection process in the output verification process is in part due to the way the design of machine learning algorithms have evolved. During the introductory phases of machine learning, programmers would instruct these machines on what to do but now they simply give them a set of instructions to follow (bottom-up as opposed to top-down machine learning algorithms). This makes them responsible for decisions, including as arbiters of online content. Such an approach lacks transparency and accountability and may also be reinforced if such algorithms are being incorporated into already-opaque governance structures. As Piwowar, K (2018), puts it, “Another challenge is algorithmic opacity, understood as the inability to audit algorithms, including in-depth inspection of data inputs, general algorithm design, as well as output data, in conjunction with companies’ trade secret, is another challenge.” Piwowar also says that another issue relates to how effectively communicating when algorithms are used, to what purposes and with what effect(s). Further, although engineers have tried to replicate human intelligence in machines, AI systems are not confined to methods that are biologically observable and will need another two decades of development, as the current algorithms lack “intuition (Stanford University, 2018). This view was also echoed in the article, “Why computers shouldn’t teach calculus’
Interactions between individuals, technologies and data/information (The ‘Social Machine’)
Algorithms are created through training datasets and can only be as good as those datasets, in other words, they are a direct reflection of the training datasets. In order to train an AI algorithm, in many cases, a large amount of data and repetitions are needed. The most influential corporations in this sphere, for example, economic agents like Amazon, Apple, Microsoft, Google, Facebook, and Baidu, wield extraordinary power from a distance. Take China, for example, and the sheer scope of the data generated by Chinese tech giants. Think of how much data Facebook collects from its users and how that data powers the company’s algorithms; now consider that Tencent’s popular WeChat app is basically like Facebook, Twitter, and your online bank account all rolled into one. China has roughly three times as many mobile phone users as the US, and those phone users spend nearly 50 times as much via mobile payments. China is, as The Economist first put it, the Saudi Arabia of data.
As David Kaye (2018), recently observed, “Tech giants develop rules, standards, and guidelines, often in Silicon Valley, to determine for people around the world the appropriate boundaries of expression. In many places, American companies provide the dominant source of news and information, having an enormous impact on public life. Much as they may try, they are often out of touch with local and national concerns in the places where they operate” (David Kaye, 2018).
Therefore, algorithmic determination of knowledge can be traced to decisions made by individuals and groups of individuals operating within particular local, linguistic, regional, religious, bureaucratic cultures. The datasets used for training, decision- making and implementation, as well as the algorithmic determination of knowledge may therefore reflect societal biases. For instance, there is currently a North-centric sensibility to the creation and training of algorithms and its dominance in the larger computational world, whereby the Global North culture has been the authoritative principle’ operative in and around algorithmic culture (IEEE P7003 draft standard on culture). This is also reflected in the online content that internet users are accessing.
Algorithmic accountability and transparency
In light of the above, there have been calls for technology and data fairness for the Global South. On one level, the solution could be found in human-centred design or usability that balances security, privacy, transparency. However, as the use of big data by public institutions is increasingly shaping peoples’ lives (Vosloo, 2018), the issue extends beyond protecting user data and privacy, but transparency and comprehension of big data (ICTworks, 2018). ICTworks suggests that in order to demonstrate a commitment to being transparent and accountable for the data they collect, organisations that mine big data need to become interpreters of their algorithms. Someone on their data science team needs to be able to explain the math to the public. Data visualizers and data storytellers should tell the story behind that data- “how we got here” explanation.
The creators and arbiters of data- organisations that use the third party big data analysis should actively ask where the data comes from, what steps were taken to audit it for inherent bias as part of the chain of demanding algorithmic accountability” (ICTworks, 2018).
However, knowledge of where the data originated requires the Global South countries to adopt the Open Contracting Data Standard (OCDS) which enables disclosure of data and documents at all stages of the contracting process by defining a common data model. The model was created to support organizations to increase contracting transparency, and allow deeper analysis of contracting data by a wide range of users. At the moment, many countries in the Global South are not being given necessary access to their countries’ own data which stays hidden under contract rules and public citizens cannot access, and therefore take the benefit, from it. The absence of regulations that mandate equal access to collected data will likely prolong the current mismatch between the pace of the data collection among big established companies and small, new, and local businesses.
Equally important in the distribution of content is the fact that the vast majority of social media act like silos. APIs play an important role in corporate business models, where the industry controls the data it collects without reward, let alone user transparency. Negotiation of the specification of APIs to make data a common resource should be considered, for such an effort may align with the citizens’ interest (Cordova, 2018)
Free flow of non-personal data
Open contracting could be augmented by the free flow of non-personal data across the region, for example, the European recently ended data localisation requirements within the Member States by adopting a Regulation on the free flow of non-personal data proposed by the European Commission in September 2017. This regulation adds a key pillar of the Digital Single Market meant to facilitate a digital economy and society.
Opening up of data through opening contracting arrangements is also seen at local levels despite the competing values inherent in data stewardship, for instance, some universities are imposing open access requirements, whereby researchers must provide access to their data as a condition of obtaining grant funding or publishing results in journals (Borgman,2018).
Inclusive algorithm design
Algorithmic accountability in the context of open contracting becomes even more necessary since big international corporations such as Facebook have been signing secretive contracts with Global South governments and local operators. This has led private sector platforms like Facebook, Google, and Twitter to become primary sources of information and vehicles for expression; they effectively function as the public square for civic engagement. Their algorithms affect their users’ access to information and how they form political opinions. This has created conceptual confusion about the roles and responsibilities of social media platforms in democracy (David Kaye, 2018).
As a first step towards data fairness, these corporations need to involve local communities in to input into the training data and social media companies need to involve such communities in governing their platforms. According to David Kaye (2018), they could take steps like diversifying leadership, enabling greater local content moderation not outsourced to contractors, and engaging deeply with the communities where they operate are essential.
If the companies cannot make these kinds of changes, they need to explore how they could design algorithms that reflect the diversity of the regions where they operate, in the case of social media platforms, spin off national versions of their platforms (David Kaye,2018).
Paper written for the HIVOS’s Africa Content Creators’ Summit, Nairobi, Kenya, December 2018 by Arthur Gwagwa-Research fellow, CIPIT, Strathmore University & Dr. Ansgar Koene – Senior Research Fellow at the Horizon Digital Economy research institute, University of Nottingham