This Bengaluru startup uses machine learning to burrow through data to prevent hacks

It’s a busy Saturday afternoon at the Amazon AI conclave, where the country’s top AI practitioners are gathered, and I’m meeting a hacker. At first, the demo seemed ho-hum, but Rahul Sasi, CTO, CloudSek, an artificial intelligence-based risk management startup, is confident that something will give.

I’d given him my phone number and a personal email ID, which he entered on a web dashboard. The search results came empty.

I then shared an older personal email ID with him, which I had used more freely. The dashboard connected it to four breaches: on Tumblr, Adobe, Dropbox, and LinkedIn. He copied a hashed value from the LinkedIn row on the dashboard, entered it on hashes.org, and read out an old and familiar password of mine.

“sw**** was one of your passwords, and you probably use it in a lot of sites, a combination of this,” Sasi says, staying poker-faced. (Editor’s note: The asterisks have been used to mask the password.) “We’re basically collecting every incident which is happening live, correlating, and sending into our system.”

Having your password read out to you by a total stranger can be a great eye-opener. Password reuse (where a customer uses the same password on multiple sites) is quite common and can be a huge risk to corporates. Take the case of the Zomato hack this summer. A developer had an account with a third-party hosting company for his personal use, which got hacked a year back. A hacker on the darkweb found that it was being used in Zomato’s production environment as well.

Sasi showed me some screenshots of dark web listings his company has found. “Here’s a guy selling a method to hack Flipkart wallet. A guy is selling Spotify and Hulu accounts,” he says, pointing to his screen. “What we do is automatically collect all this information and pass it through our machine learning system. The ML system can read and understand a conversation or listing, and collect details like urls, profiles, details of the data being breached, and the price it is selling for.”

Founded in 2015, CloudSek is headquartered in Singapore with operations in Bengaluru. Its SaaS offering monitors threats outside the corporate network, from an attacker’s perspective by tracking places on the internet where search engines typically don’t go – social networks, the dark web, deep web, conversations on underground forums, among others, for example.

CloudSek two product offerings, X-Vigil and Cloudmon, and the areas they cover.

The email demo reminded me a bit of Troy Hunt’s https://haveibeenpwned.com. And, to be sure, there are several other companies using AI in cybersecurity. Cloudsek is among the few from this part of the world in its space. Two Indian companies – Ineffu Labs and Authbase – on this CB Insights list operate in related areas. Moreover, Cloudsek seems to be finding traction in the market; five of its clients are unicorns. (More on this later.)

Finding Live Leaks

Sasi then scanned through his browser search history for a breach he’d discovered a few days earlier. He soon pulled out a text file on Github, which likely belonged to an ex-IBM employee, according to his LinkedIn profile.

In the text file, which was uploaded to Github, an online code sharing and version control service, on August 5, we found around 50 login and passwords relating to his family’s bank accounts, his life insurance policy, trading account, income tax filing, card and CVV details. It also contained logins and passwords to IBM corporate infrastructure. These included access to: IBM Downloads, Jazz credentials, IBM SVN (source code version control repository) credentials, RQM (Rational Quality Manager) and intranet credentials.

Leaking data on Github is a recurring form of what Jason Coulls, a Canadian technologist, called a ‘common-sense failure’ when he shared details of a TCS employee who had leaked banking project data belonging to at least 10 companies earlier this year.

FactorDaily reached out to the developer and notified him of the leaky text file on Github, which he promptly de-listed a few minutes later. He was not available for further comment.

CloudSek’s dashboard, highlighting its X-Vigil offering

We also reached out to IBM on the scope of the breach and what threats could emerge from these login details being shared on a public profile. “IBM is committed to protecting the privacy and confidentiality of information for its clients, employees and business partners. The document of a former employee that had once been accessible via Github contained no client data,” IBM said in an emailed statement. FactorDaily couldn’t independently ascertain the extent of damage, if any at all, the password information openly available on GitHub may have caused.

I was shown another Github profile leaking hundreds of SMSes from a major Indian bank, an international bank that was leaking its source code, an Indian travel startup that had shared mobile application source publicly, a Malaysian telecommunication company’s database that had got leaked, and a credit management platform with its user data for sale on the darkweb.

Sasi, who has worked as at iSight Partners (acquired by Fireye) and Citrix in the past, also showed me a security threat at the Bengaluru office of startup intelligence platform Tracxn, which had set up a biometric system to track employee attendance. It operated with a default password, accessible on the web. The login access could have let a hacker view, download or modify the information stored, or delete every record from the biometric device. It could deny access to employees from entering the office, Sasi says, as an example of the potential threats that could emerge from this vulnerability. We mailed Tracxn on the default password vulnerability and they have closed the loophole. (Editor’s note: The author worked with Tracxn in the past and worries his attendance information is out there.)

The CloudSek team takes a selfie at their Bengaluru office

Training Big Data, NLP

At a meeting in the CloudSek Bengaluru office earlier this week, I met their 15-member team, We went over two of its products: one, X-Vigil that provides threat intelligence. Two, Cloudmon, which tracks network and application related-security issues related to a client.

Since the start of the company in 2015, their first product, X-Vigil monitors the web, social networks, and dark web for security risks. Over time, the team sensed a need for a unified and fully automated platform.

“Traditional risk management companies use static threat detection engines and manual processes, which can be more time consuming and expensive, while with machine learning, the output of one security tool can be an input to another, and will yield better results,” Sasi says.

X-Vigil system has scanned over three billion data points so far and adds a million entries a day to its system, Sasi says, adding that not all this information is contextualised. “Only when we search for a keyword, does it get any context. Otherwise, at this point in time, the data remains unused,” he says.

CloudSek’s proprietary web crawler can go to any part of the web, register, login, and collect information, says Finny Abraham, product architect at CloudSek. It monitors more than 1,000 sources and some 3,000 blogs of cybersecurity researchers, he adds.

An demo of how CloudSek’s ML/ Deep learning technology is able to contextualise threats.

Bofin Babu, machine learning lead at CloudSek, gave a breakdown of its NLP (a branch of AI that deals with understanding language) stack. “We’re basically dealing with text data, sourced from our data collection team. Our system needs to understand the data and distinguish them as threats and non-threats. With threats, we need to understand why this is a threat, and how severe the threat is,” he says. Some guy might say “how to hack a website,” which is a query, while “hacked a website” might be a serious threat. “We use a RNN (recurrent neural network), to be able to distinguish between a query or a real threat in a sentence.”

CloudSek’s data classification is content based, with parameters weighted on the data source – Twitter, forum data, Pastebin, or the dark web, for example. “For example, on social media, how seriously a threat can be taken can be measured with the number of upvotes or retweets. Every domain has its own parameters we can leverage,” says Babu. “We use neural network models to distinguish subtle changes in input text, at the same time we use other regional parameters which can tell us about the seriousness of the threat,” he adds.

Growing custom

CloudSek has over a dozen customers, five of which are unicorns, says Sasi. None of them are Indian. While he wasn’t able to disclose his entire client list due to non-disclosure policies, he named a few such as Go-Jek, Federal Bank, and Bank Bazaar.

In terms of pricing, the ticket size varies based on the size of a company, and its IT infrastructure. Sasi says only a handful of companies have an AI/ML-based cybersecurity approach naming US-based 4iQ and SecurityScorecard as CloudSek’s direct competitors.

Product diagram showing how CloudSek collects data from the web, dark web, web applications to build a unified risk management product.

“It looks like it’s basically like an OSINT (Open-source intelligence) but using machine learning,” says a cybersecurity professional, who didn’t want to be named. Most black hat and white hat hackers do reconnaissance on a company or entity, and OSINT is kind of the first step, he says. “To do that, there’s a bunch of tools, a lot of free tools, some which are paid, as well. Looks like they’re doing the same thing, but they’re using ML algorithms to make it better. The question is, are you really using machine learning because it sounds cool or because it’s genuinely solving a problem without machine learning?”

“We’ve been collecting and training and improving our models for two years. At this point our systems are quite capable, even if someone comes in and tries to mimic what we’re doing, they won’t be able to do it,” says Babu.

Machine learning provides superior coverage, as opposed to superior analysis, says Daniel Miessler, a security consultant, in an essay last week, in which he makes a case for algorithmic analysis in infosec. “For most companies, however (say the top 90%), they probably have human security analyst ratios that only allow 5-25% coverage of what they wish they were seeing and evaluating. And for the bottom 10% of companies I’d say they’re looking at less than 1% of the data they should be , likely because they don’t have any security analysts at all,” he writes.

In their Q2 2017 report, Cybersecurity ventures, a research and market research firm predicted that global cyber security spending will exceed $1 trillion from 2017 to 2021. Meanwhile, cybercrime damages will cost the world $6 trillion annually by 2021.

Disclosure: FactorDaily is owned by SourceCode Media, which counts Accel Partners, Blume Ventures and Vijay Shekhar Sharma among its investors. Accel Partners is an early investor in Flipkart. Vijay Shekhar Sharma is the founder of Paytm. None of FactorDaily’s investors have any influence on its reporting about India’s technology and startup ecosystem.