How India’s data labellers are powering the global AI race

Kumaramputhur is a tiny village some 45 km northwest of Palakkad in Kerala, home to some 3,500 families and probably not much bigger than an average Bengaluru suburb. It has no primary industry to speak of. Its gender ratio and literacy rate are lower than the state’s nation-beating numbers. Barring some streaks of modernity, nothing about Kumaramputhur seems remarkable.

It’s in this village that Mujeeb Kolasseri, a high-school dropout, commands a team of over 200 employees working on artificial intelligence solutions for clients across America, Europe, Australia and Asia. At 28, Kolasseri is the oldest member of Infolks, a company he founded three years ago.

From a nondescript office on a highway connecting Palakkad and Kozhikode, a majority of the team is engaged in highlighting and labelling images of vehicles, traffic lights, road signs and pedestrians captured by cameras fixed on autonomous vehicles. The tougher aspect of this job is precisely marking the data captured by remote sensors called LIDAR (light detection and ranging), which creates 3D maps for autonomous vehicles to gain awareness of the objects around them.

Infolks’ office building at Kumaramputhur in Kerala

Some 2,000 km away near the banks of the Hooghly river in Metiabruz, on the south-western fringes of Kolkata, some 200 women are labelling images that will be used to train algorithms in autonomous vehicles and augmented reality systems.

“They work on some of our most cutting-edge image-related projects,” says Jai Natarajan, vice president of technology and marketing at iMerit, an India- and US-based data annotation company, which is to say its employees are engaged in labelling and preparing data to train AI algorithms.

Thousands of staffers at iMerit’s other offices in Kolkata, Ranchi, Bhubaneswar, Vizag and Shillong do similar work, labelling millions of data to help train and power AI algorithms developed by companies across the globe.

With global enterprise giants embracing AI, and the datasets that feed the AI algorithms increasingly becoming proprietary, companies need a higher degree of engagement with data labelling teams in terms of requirements, quality control, feedback and deliverables.

Because of the business process outsourcing boom around the turn of the century, Indians are no strangers to such jargon and demands. Data annotation and labelling, too, is process-driven, requiring precision work and skills that even people with a high-school education can be trained on.

iMerit founder and CEO Radha Basu at the Metiabruz centre

As the first generation of such work that was mainly crowdsourced gave way to more advanced requirements, companies such as Infolks, iMerit and Playment have come up catering to global clients and making India an emerging hub for data labelling and annotation work.

“This is an emerging sector… in India and everybody has begun to realise the humongous opportunity it presents,” says Sangeeta Gupta, senior vice president and chief strategy officer at Nasscom, India’s tech industry body. “AI requires properly annotated, classified and anonymised data. For this, whether you like it or not, you will use automation but you will also have to use skilled human workforce, and that is the opportunity it presents for India.”

The global market for AI and machine-learning relevant data preparation solutions is expected to reach $1.2 billion by the end of 2023, from about $500 million in 2018, as per a report by research firm Cognilytica.

What is data labelling?

Data labelling and annotation is a process by which datasets — from unstructured sources such as cameras, sensors, emails and social media among others, as well as from structured sources such as databases — are labelled, marked, coloured or highlighted to mark up differences, similarities or types. This is so that when the data are fed into an algorithm for training an AI system, the algorithm can rightfully identify the data and learn from it.

Say you want to train an algorithm to understand road signs using images captured by a camera onboard a vehicle. Data annotators or labellers will go through the dataset of images and mark or highlight road signs using annotation tools and feed this to an AI algorithm to learn from. The next time the algorithm encounters a road sign during a live drive through an area, it should be able to recognize the sign. The more images of road signs the algorithm is trained on, the better its accuracy.

Infolks' founder and CEO Mujeeb Kolasseri — Infolks’ founder and CEO Mujeeb Kolasseri

Driving the surge in AI or machine-learning is the access to plentiful data made available from the internet, social media, sensors and other sources. Algorithms today have the ability to absorb more data and, hence, be more accurate. As long as the data is good and clean, feeding another million datasets to an algorithm will inch up its accuracy. This has caused an unending hunger for well-annotated and labelled data for AI algorithms and applications.

Today, data preparation and engineering tasks account for more than 80% of the time involved in most AI and machine-learning projects, according to the Cognilytica report.

“If you talk about autonomous driving, one hour of video data can lead up to 800 man-hours of work,” says Siddharth Mall, chief executive of Bengaluru- and San Francisco-based Playment, which works mostly in the autonomous vehicles space.

The Infolks journey

Kolasseri, after dropping out of high school, was working in the aluminium fabrication industry but had to leave due to health reasons. At home, he signed up on Amazon’s crowdsourced jobs marketplace called Mechanical Turk (MTurk) and began taking up annotation jobs from companies across the globe.

“I was able to maintain a rating of 99.8 because of the quality I was able to deliver. One of the companies I worked for liked my work and approached me directly and offered me more work,” says Kolasseri, who then established a six-member team to get the job done. “We initially worked from a home and in early 2016, as we began to grow, I decided to register and set up the company.”

The bootstrapped operation was initially built on Rs 25,000 in investments from Kolasseri’s brother and a friend, who helped set up the company and later joined its board. Today, Infolks is a growing team, with most of its employees coming from in and around Kumaramputhur.

“The company’s vision is to reform our village to a global one as well as to provide economic opportunity to the youth of rural areas,” says Kolasseri. About “90% of our nearly 200 people are between 20 and 25 years.”

Kolasseri interacting with the team at Infolks' office in Kumaramputhur — Kolasseri interacting with the team at Infolks’ office in Kumaramputhur

While the team works on datasets across areas such as healthcare, robotics and agriculture, about 75% of their work is in the autonomous vehicles space. Clients include German automotive corporation Daimler and other international technology companies that Kolasseri could not disclose citing agreements signed with them.

For annotation, the company uses tools provided by clients or third-party tools if a client does not have one. “Our R&D team is developing our own annotation tool. It is currently being tested and should be launched in the next few weeks,” says Kolasseri. Infolks is also setting up another office in a tech park in nearby Kozhikode district. Kolasseri hopes this will boost the company’s revenue as the new location falls under a special economic zone, or a tax enclave, as well as help, expand its global client base.

India’s AI back offices

Amazon’s MTurk used to be a popular platform in India for finding data labelling and annotation jobs before it began restricting non-US workers. Although it lifted the restrictions later, MTurk’s popularity among data labellers waned as enterprise clients began placing more emphasis on data security. Also, new crowdsourcing platforms including Spare5, Cloudfactory and Figure Eight, with a greater focus on the annotation and labelling market, had entered the market.

“I worked on the Mturk platform between 2015 and 2016 before starting the company but today there are other platforms that are coming up for crowdsourced jobs,” says Kolasseri. “But with enterprise clients very concerned about data security, especially given that a lot of the datasets are proprietary, it becomes a bigger challenge for them to trust workers on such platforms.”

Playment, founded by ex-Flipkart employees Mall, Ajinkya Malasane and Akshay Kumar Lal, has approached the annotation and labelling industry with a slightly different approach.

The company has developed a slew of annotation tools for various use cases as well as a crowdsourced platform of labellers and annotators trained on these tools. The company works directly with clients or with IT service companies that have clients with data annotation or labelling requirements.

“To convert raw data into annotated structured data you require front-end annotation tools, a skilled and cost-effective human workforce, and due to the large amount of data being handled you need to have the right middleware to support different workflows and manage the remote workforce,” says Mall.

Playment’s crowdsourced platform has more than 300,000 annotators and labellers. Of them, the company recognises about 25,000 as ‘highly-skilled top players’, who, according to Mall, spend nearly all day on the platform and earn Rs 20,000 to Rs 30,000 a month on an average.

Playment, too, gets much of its work from international clients, a list that includes Samsung, Didi Chuxing Technology, Alibaba, Drive.ai and Continental AG. A major chunk of these works is in the autonomous vehicle segment.

iMerit’s strategy is centred on its employees. About 80% of its 2,000-strong workforce come from families with incomes less than $100 (Rs 7,000) a month; about half of them are women. “We have a social mission to create technology employment among underprivileged communities and in territories where there are fewer companies or industry. We operate in cities slightly lesser known for tech and with less technology employment available,” says Natarajan.

The purported altruism makes for good business sense as well. “The people we work with and the places where we work allows us to scale up the data annotation and labelling team in a very cost-effective manner and also deliver high-quality work to our clients,” says Natarajan.

Although iMerit sources a major chunk of its business from the US – clients include Microsoft, eBay and Tripadvisor — about 90% of its data annotation and labelling work is handled out of India.

Automation in annotation

Companies are beginning to develop automated tools for annotation but with a lot of jobs requiring nuanced and custom annotation or labelling work, it would be some time before automated tools can achieve a high level of accuracy.

Natarajan says that unlike five years ago when AI was about differentiating cat from a dog, present-day AI handles more advanced work. “Machine-learning has moved forward, so nobody is asking us to mark for a dog versus cat. Those days are long gone. Today, every company has customised needs and very nuanced requirements, so it is not possible to automate this or automatically just throw the data and get it labelled by an anonymous set of people.”

Jai Natarajan, vice president of technology and marketing at iMerit

The inevitable emergence of automated AI-based annotation tools, he says, is not a threat. “Automated annotation tools are themselves a result of good annotation to have been trained on them. These tools can take you only up to a certain level when you are trying to solve a problem, but to go beyond that you will need your custom annotation,” says Natarajan.

But that may be only until the automated tools can become effective enough to create good datasets. “In the longer scheme of things, we do recognise that we are in the business of making our project obsolete. When our customer succeeds, then our project ends because the AI has picked it up,” says Natarajan. “But what we also find is, it is never 100%, it is always a continuous learning and improvement process going on. Also, customers will move to the next problem and will start work again from zero.”

In other words, Indian data labelling and annotation companies are yet to peak and it may be a long while before the sector goes the BPO way.

Disclosure: FactorDaily is owned by SourceCode Media, which counts Accel Partners, Blume Ventures, Vijay Shekhar Sharma, Jay Vijayan and Girish Mathrubootham among its investors. Accel Partners and Blume Ventures are venture capital firms with investments in several companies. Vijay Shekhar Sharma is the founder of Paytm. Jay Vijayan and Girish Mathrubootham are entrepreneurs and angel investors. None of FactorDaily’s investors has any influence on its reporting about India’s technology and startup ecosystem.