The ethical data scientist

With today's application of data science for commercial gain, what ethical standards are data scientists within all kinds of companies expected to adhere to and what are the tenets of ethics that should be applied?

Ethics in the area of data, analytics and statistics are far from clear-cut black or white and it’s only getting more complicated. Of course, there are ethical matters which are non-negotiable such as anonymisation of data or the usage of data sets of questionable origin but take a small step from there and you’re getting into murky waters.

At the most basic level, ethics in the field applies to data sources. Louise Ryan, Professor of Statistics at the University of Technology Sydney, says: “The main thing for a data scientist is understanding that a lot of your data sources, your raw material, is information about individuals. You have to respect that and make sure you don't do anything that could compromise the privacy and confidentiality of an individual.”

In Australia, this training often falls to employers of data science professionals. Ryan says: “A lot of undergraduate degree programs don't include a lot of training in ethics. They tend to leave things a little bit to the workforce.”

At times, Ryan is called on to educate more senior students, such as those working at a Ph.D. level, about ethics. She says: “If you have a student who's working with data that involves information about individual people, you have to be very careful to make sure they understand the ethical issues. You have to make sure that you're not doing anything that could compromise an individual person's privacy. You have to make sure that you are protecting the confidentiality of data.”

While we can all agree it’s of the utmost importance to have ethics baked into the use of data, there’s a danger too much regulation can hold back progress. Bernard Ferguson is an R&D Program Manager at Australian enterprise software company Atlassian. He says Atlassian’s data ethics standpoint comes down to one golden rule: don’t f##k the customer.

“It's one of our five values which are practical with a wink. Don't f##k the customer is not something you're going to forget easily,” says Ferguson. “When it comes to data and ethically keeping and using customer data, it's one of the most important things we do. Our core value ... is critical and using customer data in the right way is hugely important as part of that.”

While this value is purposefully strong, it’s also somewhat loose which Ferguson says is by design. “What we find is that if you imply rules, people take that autonomy to think about the problem themselves. There is an overlay or magnitude of different policies and different things that you can do and that generally tends to slow you down,” he says. In effect, Atlassian employees know what constitutes compromising customer data but they are free from layer upon layer of regulations to ensure they deliver on the promise.

An organisation where the rules of ethical engagement are more rigid is the CSIRO. Simon Dunstall, Research Director at Data61 which is a part of the CSIRO, says legislation, government and departmental policies dictate the ethical requirements of his work. He says: “We've got a very robust ethics framework, which, on purpose I suppose, get's quite restrictive in the things that we can and can’t do. That's not a bad thing, but it does mean it is at the forefront of our thinking and makes us very aware of things like consent.”

While Dunstall refers largely to consent when it comes to the sourcing and usage of data, today’s digitally supercharged world is opening us up to a whole other area where consent is becoming an issue. Marie Wallace, Analytics Strategist and Data Governance Expert at IBM, says: “Analytics is increasingly permeating our lives but individuals aren't aware that it's permeating and that is good because of the simplicity but also bad in the sense that you don't understand what is happening.”

The example she cites is Facebook and the behind-the-scenes algorithm that delivers a curated view of the world to its users. While those in media, marketing and data professions know there is a formula at work here, many others may not. The end result, according, to Wallace, is a reliance on an algorithm to make decisions for us which, at times, we’re not even aware we are doing. “The line moves very very slowly and you don't even know that it moved,” says Wallace.

This raises a whole subset of questions about the future ethical usage of data which is set to impact our lives in ways we cannot yet fully imagine.

“Data scientists need to be thinking of the social science and societal implications of what we're doing because it is an important dimension, particularly in human analysis,” says Wallace. “If you get something wrong when you're predicting sales outcomes, you get it wrong in terms of what products you're going to sell in the next quarter. You get it wrong about a human being, it can really have a substantial effect. If it's a health care decision, if it's whether to hire or fire somebody, that can affect somebody's life. The decisions we make when we're looking at personal data, human data, has significantly more weight than if we're just analysing what I refer to as transactional data.”

With today’s data scientist as likely to be working to solve health care questions as they are applying their skills to the world of marketing, Wallace spends plenty of time pondering these ethical ramifications. While selling products isn’t life and death, there’s a valid argument that coercing someone into making a purchase could have an unwanted impact on their mental health and it’s only going to become more prevalent with the rise of the Internet of Things.

“Imagine when every Coke can is capturing information,” says Wallace. “That's the reality we could see where every single device we interact with, anything that we buy, could be engineered to be showing data and sending data. That can seem really really nice to consumers, but the data it could generate, marketers could know that Maria is drinking a can of Coke at the moment. And perhaps we could really target her for potato chips. Can you imagine how invasive that could be in our lives?”

Wallace believes this “consumerisation of analytics” calls for additional transparency. “I would like to see a level of transparency between the analyser and the analysee,” says Wallace.

Data61’s Dunstall has a similar stance. He believes the loop needs to be closed between those with access to data and the tools to process it and the people from which the data is being derived. He says: “There is a bit of a dilemma when that demographic or that person doesn’t have access to the same data and tools to, say, put forward a counter argument. There is a symmetry question that is not debated a lot but which is important. It comes back to the idea of ownership of that data, in a sense. Do people really own the data about themselves or not?”

Data ownership is another ethical question unto itself but sticking with the theme of the ethical data scientist, Dunstall believes we face an ongoing challenge. He says: “From an ethics standpoint, data scientists need to develop, as a community, more of view of what's ethical or not. There really is that danger of ‘lies, damned lies, and statistics’. It is just too easy for someone to mash data up and potentially incorrectly, and/or with an agenda, make claims about what's true or desirable, from a position where they're empowered by their skills and what they have, and implicitly or explicitly, disempower others.”

Wallace concludes that the only way forward is doing things out in the open versus under the covers. She says: “Sunlight is the greatest disinfectant.”