Thank you for taking the time to read this and I hope you walk away with a new perspective of how we use everyday technology. If you enjoyed this post, please do share with your network.
Quick Note: I hope you like the new banner! I also have a new logo, go check it out on the blog or the podcast.
Recently, I wrote about the ethical implications of using AI from an environmental standpoint. Today I want to explore another part of ethics in AI - the data we use to train it.
AI and machine learning can be a black box - you don't fully understand what goes into it, or what happens to it, but you see the output one way or another. We're told to trust it because people more clever than tell us to. But what you need to remember is that anything created by humans, have the limitations of thinking like humans - this means that the AI models inherit all the biases and problems that come with human thought, which have huge influences from forces that are beyond our control, like economics, politics and philosophy.
One can argue that creating robots to think like us might be fundamentally flawed, but if we know what our biases and shortcomings are, we can at least adjust our methodology and workings so that it can address some of the issues.
We are riddled with unconscious bias
From conversations with many people of different walks of life, backgrounds etc. I’ve discovered that the definition of ‘bias’ can differ from people to people (i.e. people don’t always like to admit they have it!). For the sake of this article, I will baseline the definition from this source:
Bias is a prejudice in favor of or against one thing, person, or group compared with another usually in a way that’s considered to be unfair. Biases may be held by an individual, group, or institution and can have negative or positive consequences.
I really don’t believe there is anyone in the world without any bias - if you think you are that person then I can assure you most likely fall into the category of having more bias than average. There are two types of bias, conscious and unconscious.
Conscious being the obvious one - the biases we are aware of. Maybe you naturally don’t like people who put pineapple on their pizza. You could be hanging out with a friend of 10 years and they whip out a pizza from their pocket and it has pineapple on it. Now every time you see them, you know, and you can’t stop thinking about it. Maybe you stop talking to them.
But conscious biases are easy to account for, you literally know about it. The other type of bias is unconscious bias:
Unconscious biases are social stereotypes about certain groups of people that individuals form outside their own conscious awareness. Everyone holds unconscious beliefs about various social and identity groups, and these biases stem from one’s tendency to organize social worlds by categorizing.
Unconscious bias is far more prevalent than conscious prejudice and often incompatible with one’s conscious values. Certain scenarios can activate unconscious attitudes and beliefs. For example, biases may be more prevalent when multi-tasking or working under time pressure.
Basically, you don’t know what your unconscious bias is. I think people really overlook that point. I’ve spoken to people about unconscious bias, and generally the conversation consisted of the individuals listing things they would have an unconscious bias for, which is literally the opposite from what I am talking about. That’s the thing. You literally cannot pinpoint these things yourself without someone telling you - it’s bias that goes beyond your literal consciousness. What’s even weirder is that it might even go against your own values! How would you know what your unconscious biases are?
Ok, so we have opinions even when we don’t mean to. So what?
Your unconscious bias kicks in while you go about your everyday life and sometimes it can start affecting people. For example, how the original iOS developers for the YouTube app were all right-handed and didn’t account for left handed people, leading to videos being uploaded upside down. The team was made of people who were right handed and therefore they had an unconscious bias towards right handed people. Or how the original Apple fitness app was developed primarily by male developers, so they completely overlooked the need for a menstrual cycle tracker. Actually, when it comes to women’s needs - (unfortunately) as per history - they are generally overlooked as most developers are men.
Bias already exist in the tools we use today
Conscious and unconscious bias, especially in AI, goes beyond some products being designed poorly or features being left out. At the moment, when training a model, data scientists will have to pull from data that is currently available or generate it somehow.
Historically, the most data generated were most likely by white males - as historically they would have the most access to resources. This is due to numerous reasons that go beyond what I can discuss here, but there is a clear lack (in comparison) of data for individuals that don’t fall into that category.
If you don’t take that into consideration, it can lead to some unintended consequences. For example, when Apple was under fire for releasing a ‘sexist’ credit card service. For whatever reason women were getting rejected or being offered a lower credit score compared to men. I don’t believe anyone was sitting in Goldman Sachs that created an algorithm that disadvantaged women on purpose. What I think is more likely is that the data they used to train the algorithm didn’t take into account historical socioeconomical background of the data. Historically, men were the ones who worked and would more likely to have credit cards and financial products to their name. In a married couple, it’s more likely for the husband to take out the loans and credit cards. I’m sure there are hundreds of other reasons that this error could have happened. Put frankly, there was probably more information out there about male borrowing and spending habits than female, and therefore the female aspect of the algorithm was not totally accurate.
Moving away from gender bias, another example is the algorithms for the software that does post processing for photography on mobile phones. As someone who taught himself photography, I learnt off various online articles and YouTube tutorials. Something that I didn’t take into account was the fact on those tutorials, the human subject was always white. All the editing techniques I learnt was geared towards a white model. When I started taking photos of my brown family members, but edited the photos using the techniques I learnt, none of my family members liked the photos. Everything looked a bit off. It took me a while to learn what was wrong and I had to re-learn/experiment myself to get the look I felt appropriate for my family’s skin complexion.
This made me aware of when a photo of a non-white subject was being processed as though the subject was white. This was really prevalent in mobile phones (in my opinion, Apple, Snapchat and Instagram were the worst at this, but they were definitely not alone. The recent OnePlus 9 has managed to get the colour balance post processing right). The thing is, those photos don’t necessarily look bad, and to someone who isn’t aware, they wouldn’t know that was what was happening. Photos have always looked like this. I had a tough time convincing people otherwise.
You can imagine my delight when Snapchat came out and admitted that their camera wasn’t inclusive enough (and Google did the same!). Photography and video was always geared towards white skin in the past, as explained in this video, and as a result many people are unconsciously training cameras to not deal with darker skin so well. Again, I don’t think anyone in any of those companies today are doing this on purpose. But that’s the problem with unconscious bias.
Data is predominantly generated by the richest in the world
If you are reading this then you one of the top richest individuals in the world. Growing up in a developed country where there is a solid infrastructure for things like energy, transport, food, etc. is a privilege that we should never take for granted. Due to these things being in place, those of us in developed countries have an easier time to access services, and in more recent times, digital services. Of a population of nearly 7 billion, around 3.7 billion people do not have access to the internet.
In my previous article about AI, I explain how training an AI model works on a very high level. But essentially, what you need to do is gather a large amount of data and then feed it into the model telling the model what to expect as the output. From that, the model will be able to look at similar data inputs in the future and be able to predict an answer.
For example, if you wanted to make a model to predict house prices, you can feed into the model all the characteristics of a house (like how many rooms, does it have a garden, safety of the neighbourhood, etc.) from the last 50 years and the price it sold for. From that the model will be able to eventually be able to predict house prices given a set of characteristics.
For many of the tools we use today, the only data available, or a large amount of data available, is from those in developed countries. By having easier access to services, or better documented records, it’s easier to generate, collect, clean and archive terabytes of data to train on AI models.
The World Health Organisation (WHO) recently came out and warned against the disparity in the data collected from wealthy nations and developing nations. Collecting data from developed nations isn’t bad thing in itself, but it can be a problem if you are using that algorithm to help/engage with people in lower income countries.
Quoting the WHO:
Data sets used to train AI models are biased, as many exclude girls and women, ethnic minorities, elderly people, rural communities and disadvantaged groups. In general AI is biased towards the majority data set (the populations for which there are most data), so that in unequal societies, AI may be biased towards the majority and place a minority population at a disadvantage.
But connecting to the internet is getting cheaper, and the new age of communication is reaching the less rich - so data is being collected.
The next billion users and the risk of Data Colonialism
Google has called the new internet users as ‘the next billion users’. It’s a Google philosophy to try build their services so that it can run on less powerful technology with weaker infrastructure. This will give a more consistent user experience to poorer nations. For example, bringing Google assistant so it runs locally, or reducing the amount of RAM required to run Android. In these countries, luxury products like iPhones are less common, and cheaper Android phones are very popular. This works great for Google as they can then collect data to improve their services - filling this gap between richer and poorer nations - and more importantly giving Google more data to sell to advertisers.
With the new internet users, there is still a divide forming - in lower income countries around 327 million fewer women than men have access to mobile internet. This will naturally lead to fewer data points for women being collected, a point that should be considered going forward in training AI models to be used in those nations. Either way, this step in collecting data will lead us to hopefully collecting data from a more diverse set of individuals and therefore creating more inclusive tools for everyone to use.
But, as we have seen with data collection and Cambridge Analytica, collecting data can be done by irresponsible people and can be a huge risk. With the huge inequality in control with the big tech collecting data and the individuals who use the services in developing nations, we could fall into the trap of ‘Data Colonialism’.
What even is Data Colonialism?
Data Colonialism is the collection of data that will be collected and use for commercial or non-commercial purposes without due respect for consent, privacy and autonomy.
It’s something we are struggling to fight against in the developed nations. These large companies with a monopoly on our services, hardware and data are collecting and monitoring everything we do with or without our consent and making huge bank off it. In return we get nothing, or worse, our data breached. We are in the new wave of pushing back and trying to get our consent back and to force these corporations to actually respect our boundaries. Up until now, not a huge amount has happened, but with the current government in place in the US, alongside the various anti-trust lawsuits in the US and Europe in motion, I hope in the near future some progress is made.
However, in many places where people are connecting to the internet for the first time, this power dynamic is even more unbalanced. Many of these individuals in developing nations might not be aware of how their data is being collected and used - and even if they do, there isn’t much they can do. The power unbalance is too large for them to even have a say in this conversation. Europe only recently implemented the GDPR and California introduced the California Consumer Privacy Act that are frameworks on addressing this very issue. It took multiple scandals and a large amount of pushback for these nation bodies to actually put something in place. Even then, it’s still difficult to enforce due to how quickly these tech firms have taken over every aspect of daily life here. In developing nations, these frameworks will not be in place and I’m sure tech firms will try their absolute best to implement themselves into the daily lives, making it harder for people to push back even if they wanted to. These individuals will have no guarantee of their data being safe and not being abused or misused, by tech firms or the government.
Beyond the power unbalance, there are all the other risks of data collection that I hope are being considered for these next billion users. For example, it’s not new that surplus data is collected and then is used for something beyond the original intention. A recent example being that Singapore assuring that any contact tracing information collected will only be used for contact tracing, only for them to admit that the data is being passed on to the police for investigations.
The usual issues of appropriate consent, privacy, adequate cyber security and only using the data you collected for it’s intended use. All of this comes before any AI is trained! This data can then be trained to create models that will be used to increase the effectiveness and profit for these large companies, only to leave the individuals the data was harvested from without any of the remuneration.
Data collection doesn’t have to be bad
This may sound like that I am against data being collected or the next billion coming online - I’m not! I love the use of AI and if carried out responsibly, it’s such a powerful tool. Looking back, there are huge biases in these models due to the inequalities we had, and have, as a society. These inequalities are finally at the forefront of some conversations and some are trying to address these concerns.
With the next billion coming online for the first time, many of these infrastructures and frameworks are not in place. We also are significantly more knowledgeable on the positives and negatives of data collection and AI training. We have the chance to create a more inclusive environment and model where these new internet users are educated on their data collection, and their data is not being abused by bad actors.
It seems unlikely user privacy concerns will be taken into account when collecting data and training AI, as it will ultimately affect these large corporation’s profit line.
But I sure as hell won’t stop talking about it.
If you have a better idea than I do, if I’ve missed out anything or you think I am talking absolute rubbish, feel free to reach out either by commenting on the post, or by emailing me on tanvirtalks@substack.com
If you enjoyed this post, subscribe to Tanvir Talks, where I publish a newsletter once a month breaking down the big questions asked in tech into digestible chunks for you to consume, the average consumer. I also have a podcast where I do the same thing!