Demystifying computer vision and deep learning with Krishnendu Chaudhury

If you’re a coder looking to add new skills to your talent stack, nothing today promises a payout as high as a hands-on knowledge of deep learning. Owing to several breakthroughs in this field this decade, this class of connectionist (brain-inspired) machine learning algorithms are driving billion dollar bets across a variety of sectors —  notably in autonomous transportation. Deep learning applications range across a variety of disciplines, from healthcare, advertising, speech and natural language processing. It’s an area where talent supply is lower than market demand not just in India but around the world.

To get an overview of this burgeoning field, the multi-disciplinary skills it takes to learn it, we caught up with Krishnendu Chaudhury, CTO and cofounder of Drishti. The Silicon Valley-based startup is using computer vision to digitise human actions on the assembly line (read FactorDaily’s profile of Drishti from late-May). The startup is eager to add deep learning talent in their engineering office in Bengaluru.

Chaudhury is a computer vision expert with decade-long stints at Google and Adobe and several patents to his name. He also headed the image sciences team at Flipkart, where he used deep learning in the ecommerce company’s visual search and recommendation system. Chaudhury, who shuttles between Drishti’s offices in Palo Alto and Bengaluru, even had a one-year teaching stint at the University of Kentucky, which awarded him a PhD in computer science.

Over a couple of hours and as many cups of coffee, we discussed how the field of computer vision has evolved over the past decades, what sparked the deep learning wave, and machine learning concepts such as RNNs, CNNs, GANs. Krish also weighed in on AI challenges and predictions for the future, how it will impact the job market, and why he feels bullish about Bengaluru.

The love of geometry

It takes a certain bent of mind to be a computer vision expert. Chaudhury, now 53, has been a practitioner at it for many decades, long before the AI boom period this decade. “I would say that the love for geometry is what brought me here. The love of my life has always been mathematics and geometry. I started liking computer vision largely because of geometry,” says Chaudhury, adding he still loves to have a go at math, algorithmic and geometry puzzles in his free time.

Prior to his PhD, he attended Jadavpur University for his Bachelor’s degree in electronics and telecommunication after studying at IIT Kharagpur for a while. Parental pressures might have just as easily pushed a younger Chaudhury into a career in medicine, had he acquiesced to his father’s demands. “My dad (who passed away last year) was a professor at Calcutta Medical College. He felt that family should stay together and wanted me to become a doctor, so he brought me back from IIT Kharagpur and enrolled me at Calcutta Medical College. But I didn’t like studying medicine and dropped off. Thereafter, I joined Jadavpur University,” he says. “I actually stayed in IIT Kharagpur for the whole ragging period. In my days, ragging was serious. I think I was the was the second-highest ranked person to actually leave IIT that year,” he recounts.

He recounted his computer-vision oriented doctoral dissertation paper from 1994, titled ‘Motion Estimation from a sequence of Intensity or Range Images’. “For example, in the military, when you’re aiming your gun at an aerial target, you want to be able to auto-adjust the gun so you hit the target. For this, you want to estimate the directional trajectory of motion and speed of the target from a sequence of images. Let’s say the camera takes about 30 frames per second, you want to estimate the motion parameters of the target.” As this was a NASA project, his work wasn’t used directly for military purposes, but for space research, he says.

He pulls and draws on a paper napkin to explain the concept of gradient descent (more on that later) showing teaching is an important side of his personality. “I mentor young engineers quite a bit. In Google and Flipkart and now in Drishti, a lot of my work is actually taking a set of young bright people and leading them to victory,” he says.

His work life

Familiar with the potential and perils of Photoshopping today? Chaudhury gave shape to one of the earliest versions of image morphing at Adobe, where he worked as a senior computer scientist in the Advanced Technology Group. “I invented one of the earliest versions of image morphing (1999-2000 timeframe, patented around 2003),” he says. “An elementary form of morphing is taking a person’s photo who isn’t smiling and making them smile digitally.”

At Google, he worked on several projects that employed machine learning – notably Google’s newspaper archive search feature launched in 2008 on Google’s 10’th birthday. This feature, still accessible via Google news, lets you read newspapers archives from hundreds of newspapers and thousands of issues from the 1800s and early 1900s. E.g.: this Indian Express edition from June 5 1947, which talks about Lord Mountbatten’s plan to transfer power from British hands to India.

From Google’s newspaper search archives of Indian Express. This edition is of June 5, 1957.

“History is always distorted as you go forward. For example, what you hear about Hitler or Nazi Germany now maybe very different from the contemporary thinking (then). How did people of that day and age think about Hitler? If you search in Google for Hitler’s death, it would be quite interesting to see an article from Berlin Times in 1945. This was the idea behind the product,” Chaudhury says.

There were several technical challenges in this project, starting with digitising them from microfilms and detecting text from discoloured and scratched newspaper images. Individual articles had to be extracted out of newspaper pages by making sense of often inconclusive visual cues, such as white spaces (known as gutters) separating the articles. “This was a machine learning, image processing, computer vision task,” he says.

Another problem was front page identification. “In the microfilm that stores the old newspaper, dates are not recorded. The pages are contiguous, so we don’t know when, say, 5th July ends and 6th July starts. So, we created a program to identify the first page of the newspaper. The cues for it are the stylised headings of the newspaper, the name of the newspaper and since this problem is so diverse, machine learning was used to solve it,” he says.

A man by the name Punit Soni was the product manager for the Google newspaper archive team. He and Chaudhury would work together at Flipkart later — Soni as the chief product officer.

Better photos

Among other highlights from his time there, Chaudhury managed Google’s own version of image compression called WebP which was launched in October 2010, worked on auto-rectification of photos at Google Photos, devising a complex mathematical way to restore parallelism of the lines inside the image, making them more pleasant to look at. He also worked on an early version of face recognition-based login for Android, although the product wasn’t launched by Google.

“I made a product management decision error in that product,” says Chaudhury with candour rare in technology circles.  He tried to make the product too secure and, in the process, it became too heavy. “We tried to make liveness detection – ensuring the face being shown to the camera is a live face and not a still photo – an integral part of the product. For this, we employed gaze detection. A randomly moving dot was shown on the screen and the user was asked to track it with her eyes. The computer analysed if the direction of gaze matched that of the moving dot. I patented that technology too. Ultimately, I got push back that it was harder to use than simply typing in a password,” he says.

At Flipkart, where he led computer vision and deep learning projects, his team led the first visual recommendation engine, which would recommend visually similar products. At the core, the problem revolved around teaching the computer the notion of visual similarity.

Some of the challenges with deducing visual similarity, as described by Chaudhury and co-authors in a research paper.

Similarity is very subjective word. Similarity can be conceptual (say, two t-shirts with spooky prints) or detailed (two t-shirts with very similar stripes). Humans can see the similarity between a shirt worn by a human and the same shirt hanging on the wall. But their images would look very different to a computer,” Chaudhury says. “Again, consider an evening dress on a mannequin standing at some arbitrary pose versus the same dress hanging on the wall. To computer, these images would look very different while a human will say they are similar. Because of such extreme variability in the notion of similarity, a computer has to learn the concept of similarity through deep learning.”

Chaudhury says he has very fond memories of his time at Flipkart, where he got a free hand to pick the best and brightest engineering talent and mentor them on this machine learning challenge. “I think very fondly of those young folks that I worked with, some of the best I ever saw in professional life were at Flipkart. Part of my liking of Bangalore comes from that experience,” he recounts. “Flipkart has very good people, their processes could improve… like data collection etc. But as far as engineer quality goes, they were very good is my feeling.”

Engineers to deep learners

As someone with experience mentoring young engineers, Chaudhury has a good idea of the kind of qualities to look for in a deep learning professional. “It’s not that easy for a rank outsider to enter this world in a serious way. If you are solving a new problem, where do you get training data? Even universities sometimes suffer from lack of training data,” he points out.

Entry level criteria for a deep learning engineer include: great programming skills – specifically knowledge of Python and C++. “The other thing I look for is math skills – in particular, linear algebra. Machine learning is way more heavy math, compared to what many other branches of computer science would need,” Chaudhury says. “With these skills, learning TensorFlow is going to be a breeze. Tensorflow is a very complex beast. Without good linear algebra and geometry fundamentals, you will not get the right intuitions,” he adds. Tensorflow is an open source library, originally developed by the Google Brain team.

Chaudhury is sceptical about most of the MOOCs (short for massive open online courses), which, he says, water down the discourse in order to appeal to a large audience. “They exactly give a set of commands – you can brainlessly go and type them and see results. You have not learnt anything if you have done that,” he says. “It is only when you try to solve a problem in the real world, using your knowledge – that is when you learn.” For people who are just starting out, he recommends Andrew NG’s course on Coursera and for serious practitioners, his lectures at Stanford.

He also warns against the tendencies of engineering outfits in India to apply a thrifty, short-termist mentality when tackling hard AI problems. “There is a tendency to quickly cash out. You can’t create a serious AI application by taking a pre-trained model from somebody else and using it on your problem. Training my own models needs good investment in hardware, which many startups are not willing to do. In this case, they will keep getting mediocre results.”

“Why has nothing close to Google, Facebook, Twitter, Adobe  and other deeply innovative companies come out of India,” asks Chaudhury. “There is no dearth of brain here but what is lacking is the mindset that it may take me a little bit longer but I will hang in there and generate deep work. Instead the mindset is – let’s make some quick wins. The same mindset manifests as – I will take somebody else’s model and apply it here, rather than create my own architecture and train my own machine.”

Machine learning as a geometry problem

“In some ways machine learning is n-dimensional geometry on steroids,” says Chaudhury, adding that most machine learning problems revolve around building classifiers. While it’s hard for us to visualise in more than three dimensions, one’s intuitions from three dimensional geometry, can be used to imagine the n-dimensional geometry, he says.

“Let’s say I want to build a classifier, which says ‘Are you sitting on that side of the table or this side of the table?’. During training, the machine will see  a bunch of points on that side and a bunch of points on this side. The machine will learn a separator, which will be like a plane passing vertically through the table. During inferencing, machine will get an unknown point. It will check which side of the learnt plane (the separator) the point is on, and make its prediction, this side or that side,” he says. “Each instance of the object we are trying to recognise is a point in n-dimensional space. Given a lot of training data, the machine learns a separator between the cluster of points belonging and not belonging to the object. For many machine learning systems, the separator is a hyper-plane.”

One strength of neural networks is that they can learn to separate with non-linear (i.e., curved) surfaces, Chaudhury says. He takes the example of how a facial recognition system would work. “Suppose the classifier I am building is trying to decide ‘Is this a face or not a face?’. How would I do that? I will take a bunch of features.. let’s say colour, that’s one dimension. Is it brownish or white or black? If you’re green, it’s probably not a face. Are there black-ish things (eyes) near the top? Ultimately, every candidate face then becomes a point in this many-dimensional space. The true faces will form a cluster of points relatively close to each other. During inferencing, the machine sees a candidate face, maps it to a point in the feature space, checks the position of the point vis-a-vis the learnt cluster, and makes a face/non-face prediction,” he says.

“One noteworthy thing about neural networks is that they even learn what are the features important to perform the specific classification task at hand,” he explains. “Thus, one does not extract and provide features as inputs. Rather the entire image is provided as input and the machine learns what features will be good to make this classification, how to combine them etc.”

Gradient descent, backpropagation

One of the concepts key to understand deep learning is gradient descent. Chaudhury helped us understand what it is by drawing a picture of what it looks like – again, on a paper napkin.

“Almost always in machine learning, we’re minimising a function. We have an error function, we are incrementally minimising that. How do we minimise the error function?” he asks. “We don’t know the shape of the function. Let’s imagine a bowl, we’re somewhere on its surface, and we’re trying to move towards the bottom of the bowl, in steps. Thus we constantly move downwards. Remember, however, the bowl is on a high dimensional space, not in a 3D space. There are a certain class of functions which are very friendly… these are called convex. Convex functions have a single global minima. If you constantly move downwards, we will eventually hit the global minima. Non-convex functions typically have local minima. While doing gradient descent, one can get stuck in the local minima, yielding a non optimal solution.” he adds.

“Funny thing is, I have been doing this for 20 years. In the old days, we tried many things to get out of local minima. Everybody thought that’s such a bad thing to happen. You know what deep learning does to get out of local minima? Nothing. It just assumes that if you train anything, sometimes you will start here, and get stuck here. But if you started here, for example, and kept going downwards, you would have hit the global minima. So deep learning says, just train repeatedly, eventually you will get there,” he says.

Chaudhury went on to describe the concept called backpropagation, a technique by which the weights in a neural network are altered to minimise errors.

“The error function effectively measures how well you are doing on the training data. Backpropagation is effectively trying to change the weights so that the error is minimized. It is doing a gradient descent, trying to go to a minima. For a multi-layered network, a method was found to do this in a structured fashion so that the complexity goes down significantly. In a layered model, each layer does its backpropagation in an iterative chained manner. This idea changed the ball game. You are not working on all weights at the same time; instead you only work on one layer at the time,” he says.

Deep learning lingo, explained

Chaudhury went on to explain some of the concepts that deep learning engineers use a lot – GANs, CNN, RNN and LSTMs. “GANs (Generative Adversarial Networks) are machines that, in addition to training to produce a desired result, it trains an adversarial network, a network that creates more challenging test cases for itself. So it is learning two things at the same time, it is generating the output it wants to generate, as well as a harder test case for itself every time. So basically, the machine is improving by challenging itself. This can reduce the need for training data,” he explains. These are particularly useful when training neural networks, as training data is a bottleneck for neural networks. “A baby can see five chairs and know what a chair is. Why do we need to show 500 chairs to a neural network before it can recognise a chair?,” he asks.

“CNN (Convolutional Neural Network) is an image-centric technique. What happens is you feed the image to the network. Now it has been proven that most human minds focus on the edges and corners to recognise things. CNN would essentially create edge and corner detectors at the bottom layer. As you move to the forward layers, the ones closer to the output, you start recognising higher-level concepts like nose, ear, and then finally a face. At the highest level of abstraction, it can say this is X person’s face.”

A real useful variety of RNN is called LSTM – long short-term memory. These are the ones we need to watch for. We haven’t seen even the beginning of the LSTM revolution”

““RNNs (Recurrent Neural Networks) are good for recognising sequences, where the output is fed back into the input to make a loop. Modern-day RNNs are hard to train. The weights will get close to zero, and once its near zero, it’s not going to move very much. We are constantly multiplying, and zero multiplied by zero stays zero, so RNNs will stop learning after a while. A real useful variety of RNN is called LSTM – long short-term memory. These are the ones we need to watch for. We haven’t seen even the beginning of the LSTM revolution. Andrew Karpathy says a CNN is like a mathematical function, whereas an RNN is more like a program – with much more flexibility,” he says. Karpathy is the director of AI at Tesla, and a former research scientist at OpenAI.

Deep learning godfathers

As someone with plenty of decades of experience at computer vision and image recognition, Chaudhury has an intimate understanding of how the field has evolved. While neural networks have been around for more than 20 years, the era of big data and graphics cards made them feasible, he says.

“People realised around 2008 that neural network training could be greatly speeded up (from months to days or even hours) using GPUs (developed for rendering video games). Both neural networks and gaming requires gigantic matrix multiplications and what speeds up the latter also speeds up the former,” he says.

“NVIDIA is crucial to this story, as crucial as Google (and) the big data revolution… and Geoff Hinton,” he says. Hinton, acknowledged as the godfather of deep learning, is known for creating the backpropagation algorithm used to train neural networks. “Geoff Hinton had been doing neural networks for many years … in relative obscurity for over two decades. I believe that 20 years from now, we will talk of him like people talk of Einstein,” he says.

Hinton, who won the ImageNet challenge in 2012,  has had an immense impact on the computer vision field as well. “Around 2010, basically the whole of computer vision started moving into machine learning. This made some parts of my earlier learning kind of useless. For instance, I put in huge effort to deeply understand Fourier Transform. This was necessary in those days as part of image processing. These are becoming less important now. Except for conception building, its not directly needed that much anymore,” he says.

These days, all computer vision problems are deep learning problems, he says. “Little else exists, to be honest. I know a lot of computer vision folks, especially the older ones would hate me for saying this,” he says.

Implications of the deep learning wave

The first wave of computers were used for clerical tasks. Things which were totally deterministic and bored a human being, says Chaudhury. He predicts that there will be a huge competition between humans and robots in the next 20 years, and in this battle, his current startup, Drishti is rooting for humans.

The job of an expert in any field is not threatened right now and it will be a long while before this happens, he says. “Deep learning is never going to guarantee 100% accuracy. That’s impossible to do because it is a statistical thing by nature. So you’ll always need a top-level expert who has to vet results.”

“Deep learning will usher in more creativity into human society. This is nothing short of a revolution and it’s happening right under our eyes.”  

However, entry-level people to do the preliminary reporting, will see their jobs threatened, he predicts. Creative professions, such as a comedian or abstract painter or developer of mathematical theories or social entrepreneurs – their work is not going to be taken away by computers anytime soon, he says.

“Deep learning will usher in more creativity into human society. This is nothing short of a revolution and it’s happening right under our eyes. But that does not mean humans will have nothing to do. Take the comedian. We are 100 years away from generating real jokes with technology, if ever. These really creative performers will not be threatened by technology. The entry-level programmer who codes might be threatened by artificial intelligence but the guy who dreams up a new algorithm is not threatened at this point.”

“It is not the lawyer who makes arguments in court whose job is threatened. It is the entry-level person who does the research and fact-finding for the case – that person is threatened,” he says.

As for AI problems that researchers are looking to solve, Chaudhury names a few. “Object detection is largely solved, while action detection is not that well solved. That’s one of the immediate next frontiers. And Drishti is working in this space.”

“Lowering the training data need is one other frontier to conquer,” says Chaudhury. “Another is generalisability. Right now we create one model to recognise a football, and another to recognise a face. And, if you want to recognise human actions, it is a third model. Everything is a special-purpose model, unlike the single machine in our heads.”

“The machine that plays Go is not the same as the machine that plays Chess. There is one exception – the reinforcement-learning based machine. At Google, there is a team called DeepMind. Demis Hassabis is working on machines that can play all games – chess, checkers, Go, all sorts of things. This is a giant leap towards generalisability. But we are still nowhere close to creating a machine that can analyse Shakespeare’s quotes and also play cricket. That is true generalisability. This would be the next 10-20 year frontier.”


Pictures and visuals: Rajesh Subramanian

Disclosure: FactorDaily is owned by SourceCode Media, which counts Accel Partners, Blume Ventures and Vijay Shekhar Sharma among its investors. Accel Partners is an early investor in Flipkart. Vijay Shekhar Sharma is the founder of Paytm. None of FactorDaily’s investors have any influence on its reporting about India’s technology and startup ecosystem.