D J Patil on data science's relevance, Aadhaar, and the importance of ethics

Story Highlights

"Just because you can with data doesn't mean you should," says D J Patil. These data sets, with GST, biometric information, are extraordinarily sensitive
Aadhaar and the Goods and Services Tax Network are something that India should be “proud of having rolled out"
The best way to have the discourse on the ethics of data science is to have policymakers make sure that technologists work hand in hand with with them and also have “real people” at the table

As India rolls out technology-based systems like Aadhaar, policymakers and technologists must also focus on the ethics of data science and bring real people into the discussion, one of the top data scientists in the world said.

D J Patil, a former chief data scientist at the White House, said in an interview with FactorDaily, “Just because you can with data doesn’t mean you should. So, these data sets, with GST, biometric information, these are extraordinarily sensitive data sets.”

Patil, who believes that programs like Aadhaar and the Goods and Services Tax Network (GSTN) are something that India should be “proud of having rolled out,” said that the discourse on the ethics of data science is an important one to have.

Patil, who believes that programs like Aadhaar and the Goods and Services Tax Network are something that India should be “proud of having rolled out,” said that the discourse on the ethics of data science is an important one to have

“I have not spent enough time understanding all the nuances, or the different complexities with different parts of India with regards to the system. So, it wouldn’t be appropriate for me to have to say this is right or wrong,” said Patil, who is on a three-week trip to India. “But everybody should be asking, do we want the data to be used for this? Or this? That’s a question for society. Data scientists should be hoping to facilitate that conversation,” he said.

The best way to have this discourse is to have policymakers make sure that technologists are working hand in hand with with them and also have “real people” at the table. “Not some lobby group that represents people. The actual people. And they have to all be sitting at the table, talking about the hardest parts of the problem,” said Patil, who along with top data scientist Jeff Hammerbacher coined the term “data scientist”. During the Obama administration, Patil became the first ever chief data scientist to be appointed by the US government.

Patil covered a range of topics — from using data science in medicine to justice systems and starting a career in data science — in an hour-long conversation with FactorDaily. We’re running the interview in two parts. In this first part, the edited excerpts deal with the ethics of data science and some burning questions.

You co-authored a post in Harvard Business Review in 2012 that called data science The Sexiest Job of the 21st Century. Does that still hold true? How much of a data scientist’s job has been automated, or is at risk of automation?

That article was written with my co-author Tom Davenport, and he has been an analytics person working in the field for decades, much longer than I have. We didn’t actually come up with the title for the article. Harvard Business Review came up with the title. And so would I generally say that it shows that Harvard has learned something from the Kardashians. About how to get people to click on something.

The part we think is we’re at is the beginning of seeing what people can do with data, whether (we) call it data science, statistics, economics, computer science or machine learning and AI. These things are just starting, but they’ve been also going on a long time

It really has proven that you can take extremely unsexy things, and if you look at them, and you see the value and the power that they can provide, they are extraordinarily interesting and sexy areas, as they provide disproportionate value.

So that is fundamentally why data is viewed as such a sexy area. The part we think is we’re at is the beginning of seeing what people can do with data, whether (we) call it data science, statistics, economics, computer science or machine learning and AI. These things are just starting, but they’ve been also going on a long time.

Speaking of data-driven companies such as Facebook and Google, has India lost the opportunity in this space to create consumer-focused companies on data moats?

No, I think the opportunity is ahead. You know, Flipkart, Ola, are two great examples that are in just that space, but there’s so much more, India has an incredible opportunity to really make the notion of smart cities real. The investment that’s being put in place, the raw passion of the entire population is extraordinary. You know people talk about — well India doesn’t have the technical capabilities — India just rolled out the ID cards (UIDAI). That like a billion people on these cards, and all the infrastructure, and all the different things that are required to make that work.

Also see: How data is making Delhivery India’s first e-logistics unicorn

Opportunities in India to harness big data. In a previous interview, you had mentioned how the GST records are also a huge treasure trove of information…

One thing that’s very important to talk about is that a key aspect that’s intrinsic to data science is the need for ethics. One of the things, when I was in office that we called for is that every data training program — data science, economics, statistics, computer science, machine learning, whatever you wanna call these things — they all have to have ethics and security as a core component of the curriculum. Just because you can with data doesn’t mean you should. So these data sets, with GST, biometric information, these are extraordinarily sensitive data sets. That doesn’t mean we should be using them in these different manners. There’s the use of the data, and how the data is collected and making sure that there’s an ability document people. That’s great. But everybody should be asking, do we want the data to be used for this? Or this? That’s a question for society. Data scientists should be hoping to facilitate that conversation. I think one thing that’s very important to talk about is that a key aspect that is intrinsic today in sciences is a need for ethics.

Also see: India’s top tech architect talks about the tech behind GST, data empowerment

With all the data dumps going around…

That’s what we refer to as a breach. Someone coming in and stealing your data. That’s critical. What’s going to happen is we’re going to have a different type of attack, which is people are going to come in and corrupt data, or manipulate data. We’re seeing versions of that right now around ransomware, where people come in, and encrypt your data, and they hold it hostage, they hold it for ransom. And only if you pay them do they unlock your data. We need to get ready for that event. Many of these bad guys are going to use artificial intelligence and machine learning to figure out how to attack systems equally as much. And we need to get ready for that type of fight. And the way to defend that is that the machinery is going to be very data-driven.

How does one get around to talking about the ethics of user data, and to what end?

First, it starts with a common language, so that we can have a conversation about ethics. Right now, there isn’t that common language. So that’s why it’s so critical, that every person that trains in any kind of data science program must have some ethics curriculum. It can’t be some elective thing that you take later, or on the side. It (needs) to be built into everything. If you’re trying to do a project where you’re collecting data about, say, people’s faces, and you’re doing facial recognition, you have to ask yourself right at that moment, is it acceptable to be collecting and storing a child’s information? Some cases, the answer may be yes. Many other cases, the answer is probably no. So, how do you make that nuance? You have to have a vocabulary to do that. The next part of that also comes into making sure that that data is secure, and safe, and how do you store it, where do you put it. And that’s why security has become equally part and parcel with (sic) this conversation. The other thing that’s there is that it’s not just up to data scientists to decide what is ethical. It’s up to the community. It’s the people who have contributed the data. The number one way to do that is data scientists don’t just operate in a vacuum. You have to have the actual people having a seat at the table as you’re designing the system, designing the algorithms, and thinking about this.

Right now, everyone’s focused purely on the access of data, and do you have my data, can I have my data… We’re not talking about the algorithms that sit on top of this. Well, how do we ask what happens if the algorithm biases me versus another person? What does that world look like?

Right now, everyone’s focused purely on the access of data, and do you have my data, can I have my data… We’re not talking about the algorithms that sit on top of this. Well, how do we ask what happens if the algorithm biases me versus another person? What does that world look like? You know that’s where this starts out. By the way, this is happening now. If you apply for a job and you take some type of personality test, you don’t know what and how to think about this.

Also read: Biased bots: Artificial Intelligence will mirror human prejudices

Let me give you a very specific set of examples. Right now, you take it to personality test. You have no idea if that personality is accurate, if it thinks about the system, or its being biased in some way. In the United States, we have this, if you go get put in jail, or you know somebody thinks you’ve done a crime, you have to post what we call a bond. There’s a calculation of how much you should pay. And so people actually have built these calculators for judges to help assess you. Turns out some of these calculators were using race, and other type of features. That’s not ok! How many other places is this happening? If a self-driving car is being trained, only around populations where there is one type of people, like say white people. Is it going to see a person in a saree? Is it going to see a black person? Is it going to see a person that’s in a wheelchair, with crutches? We need to start to start asking those questions of what’s in the data sets? That is coming fast. That world is here.

The way I try to explain to people is, in fabric of society we live in it (sic)… we see cracks in it. Journalists tell stories about it. Activists help rally people around to say this is a problem. We design policy. We find solutions, and we fix it. What happens when the fabric of society is digital? If you show up to get medical care, or to get a service like a ration or something, and the computer says no. What happens if it was just one database being slow and updating the other database? What then? How do you have recourse? The way we have recourse right now is somebody is there that’s a human, and they’re able to say, ‘oh clearly this is a mistake. I’ll fix it.’ What if there’s no human there, and it’s just a machine? What if it’s an algorithm, and somebody says, well sorry, the machine says this. What’s your recourse? How how do journalists tell stories about it? How do activist rally people to the cause? We just take for granted what we see on the screen. That’s an exceptionally dangerous world. It’s dangerous because it is susceptible to attack, and it could be biased from the start. That’s the world we have to start getting ready for. That’s why ethics and all of this is so critical.

Also see: Computing pioneer Alan Kay on AI, Apple and future

There are arguments for and against Aadhaar. What is your take?

So, the very first thing that I need to be upfront with is I have not spent enough time understanding all the nuances, or the different complexities with different parts of India with regards to the system. So it wouldn’t be appropriate for me to have to say this is right or wrong. But here’s the part that I can say very concretely. Most big IT projects and technical initiatives of this nature have typically failed. Yet this one has gone out. It’s rolled out! So, everyone should first be proud of the fact that such a big thing was able to be accomplished.

Now, the next part I think it is a question (sic) of how does it get utilised, and what are those questions? That’s what I would encourage people to do, that it’s a public discourse that has to happen. And the best way to do that is for policymakers to make sure that technologists are working hand-in-hand with the policy people, and that real people have to be at the table too. Not some lobby group that represents people. The actual people. And they have to all be sitting at the table, talking about the hardest parts of the problem. That’s the only way you get to right answers here.

But if the general measure is — can we actually have an accurate estimate of how many people are in the country, and how do we actually provide better services for them? That sounds like a very lofty goal to me

The part there that is going to be critical in where this often goes wrong for a policy perspective, is that policy people don’t have technologists, and so they say ooh! That’s a great system, let’s use it for this. But the system was never designed for that. And so and I don’t know if this is true or not, but it’s very likely that people are asking these systems to be used for things that it wasn’t actually designed for. We blame the technology, rather than what we’re trying to do with it. But if the general measure is — can we actually have an accurate estimate of how many people are in the country, and how do we actually provide better services for them? That sounds like a very lofty goal to me.

Your thoughts on the impact of big data, and in particular, Cambridge Analytica, on the pro-Brexit campaign, and Donald Trump’s presidential campaign?

I don’t have a comment on Cambridge Analytica, because I’ve spent no time on it. But I find it generally hard to believe that any analysis just on social media, or psychometric, or any of these things actually really has the values that have been claimed. It’s so hard just to do the basics of analysis. It strikes me as pretty far fetched.

But the other part we should talk about is: Everyone should be extremely clear — the United States democratic system was attacked in our last election. India is about to go through an election, and it’s wise to recognise the type of attack that we saw against the United States. A similar attack was done against France. That attack happened in two ways. One: It was very clearly an attack by trying to get information out of people’s e-mails that was private. And stealing information. The second: Was using propaganda and abusing systems to get messages out that were misleading, and designed to elicit fear from populations. That is what people refer to as fake news. That can happen here in India, equally as well as it happened in the United States.

We are now going through a tough conversation about how to make sure that doesn’t happen again. I would encourage everyone here in India to really start asking about how are we going to ensure, that the world’s largest democracy is not going to be impacted or abused in any similar way.

Video and lead image by Rajesh Subramanian

Former US chief data scientist D J Patil on data science’s relevance, Aadhaar, and the importance of ethics

Other top Stories

The Flipkart data view: Q&A with Mayur Datar, AI team

How India’s data labellers are powering the global AI race

Becoming more like WhatsApp won’t solve Facebook’s woes – here’s why

NITI Aayog’s Avik Sarkar aims for democratized, data-led discussion and analytics in India