It's increasingly hard to know what to do with data like that from Cit0Day. If that's an unfamiliar name to you, start with Catalin Cimpanu's story on the demise of the service followed by the subsequent leaking of the data. The hard bit for me is figuring out whether it's pwn-worthy enough to justify loading it into Have I Been Pwned (HIBP) or if it's just more noise that ultimately doesn't really help people make informed decisions about their security posture. More on that shortly, let's start with what's in there and we're looking at a zip file named "Cit0day.in_special_for_xss.is.zip" that's 13GB when compressed:
A couple of folders down are two more folders named "Cit0day [_special_for_xss.is]" and "Cit0day Prem [_special_for_xss.is]"
And then this is where it gets interesting: The first folder has 14,669 .rar files in it whilst the second has a further 8,949 .rar files giving a grand total of 23,618 files. This is where the "more than 23,000 hacked databases" headlines come from as this is how many files are in the archive. Because it's relevant to the story and especially relevant to people who find their data in this breach via an HIBP search, I'm going to list the two sets of files in their entirety via the following Gists:
Let's drill deeper now and take a look inside one of these files and I'm going to pick "chordie.com {1.515.111} [HASH+NOHASH] (Arts)_special_for_XSS.IS.rar" simply because it's one of the larger ones. Here's the contents:
Taking that first and largest file from the archive, there are over 1.5M lines comprised of email address and MD5 hash pairs. I'm going to highlight one particular row that used a Mailinator address simply because Mailinator accounts are public email addresses where there is no expectation whatsoever of privacy. Here it is:
traw@mailinator.com:bb796fbe5b644a2a88e3c75207ca4b54
When looking at the "Results.txt" file, that email address appears with a cracked password:
traw@mailinator.com:janid
The "NotFound.txt" file consists of email address and MD5 hash pairs and for each hash I randomly Googled, no plain text result was found so this appears to be hashes that weren't cracked. The "Rejected.txt" file contained malformed email addresses and "Result(HEX).txt" had a small number of email address and password hex pairs. This same pattern appeared over and over again across the other archives and it gives us a pretty good idea of what the data was intended for: credential stuffing.
I extracted all the files, ran my usual email address extraction tool over it (effectively just a regex that can quickly enumerate through a large number of files), and found a total of 226,883,414 unique addresses. A substantial number, although not even in the top 10 largest breaches already in HIBP.
But is it legit? I mean can we trust that both the email addresses and passwords from these alleged breaches represent actual accounts on those services? Let's take the example above which allegedly came from chordie.com, a guitar forum. Over to the password reset and drop in the Mailinator address from before:
Apparently, an email has been sent to that address which indicates it does indeed exist on the site:
And sure enough, in that public Mailinator inbox is the password reset email for a user by the name of "trawis":
Consequently, there is a very high likelihood this data is legit. I haven't notified Chordie as they're one of more than 23k sites listed so clearly disclosure in the traditional sense isn't going to work, at least not where I privately contact the company. But each time I checked, the pattern repeated itself; rakesh_pandit@mailinator.com has an account on fullhyderabad.com:
Or over on sandhuniforms.com, pentestaaa@mailinator.com also had an account:
In that example, the data was found in a file called "www.sandhuniforms.com {54.629} [NOHASH].txt" and true to its name, it appears from the forgotten password email that they were never even hashed in the first place. Same again for johnbvcxzy@mailinator.com on acdc-bootlegs.com:
I'm conscious I'm showing actual email addresses and either passwords or reset tokens in the images above, but again, these are very clearly test accounts with no expectation of privacy. I'm showing these for impact; this is a serious set of data that includes actual breaches that are almost certainly unknown by the site operators.
Many of the sites indicated in this collection of data are now defunct. For example, as of the time of writing, flyinghearts.info simply returns "Forbidden". Back in May, it was a service for blokes to meet Czech women according to archive.org. Or take cyberlearningmauritius.org which is returning HTTP500 today, but in Jan last year was a (self-proclaimed) global leader in digital education.
At least one other site in the collection was previously (publicly) known to have been breached and in this particular case, was already in HIBP. For example, "hookers.nl {287.560} [HASH+NOHASH] (Adult)_special_for_XSS.IS.rar" is already in HIBP as a sensitive breach. I'm sure there are probably others too so inevitably this isn't 100% new data, let's see if we can put a number on that:
I was curious as to how much of this data had been seen in other breaches before and if there was an obvious trend. For example, is this largely just data from, say, the Collection #1 credential stuffing list I loaded early last year? I took a slice of addresses from the 226M I'd extracted and started running them against HIBP. Here's what I found after checking over 74k addresses:
Only 55% of the addresses in the sample set had been seen before (after loading the complete data set into HIBP, that number rose to 65%). There were a bunch of addresses in the Collection #1 incident and also in the 2,844 breach collection I added in Feb 2018, but clearly based on the red "null" results there were also many new addresses. In other words, there were a substantial number of people who prior to loading this data, would get no hits when searching HIBP but had previously been in a breach.
Then there were the passwords. Eyeballing them, they're all the sorts of terrible passwords you'd expect most people to use. Passwords like "Ashtro1969", "Odette1978" and, perhaps unsurprisingly given the file I was looking at, "ilovechordie". Whilst many of the passwords I tested were terrible enough to have previously appeared in other data breaches and flowed through to Pwned Passwords, these three didn't exist there at all. In fact, over 40M of them didn't exist at all.
The passwords, however, do also pose a bit of a conundrum when parsing them out of thousands of separate files. Whilst many existed as credential pairs in the "Results.txt" files of the respective archives, others existed in files such as "libertidating.com {1.928} decrypted.txt" (they're almost certainly cracked hashes rather than "decrypted" ciphers) and "promotionalproductsglobalnetwork.ca {2.166} [NOHASH].txt", the latter possibly indicating that passwords were never hashed to begin with. So, thousands of files, different naming formats and whilst mostly consistent in terms of structure, inevitably there are some parsing issues in there. For example, this "password":
3px;"><a href="docs/!INDEX.html"><b>Ãëàâíàÿ</b></a></div><div style="padding-left: 10px; padding-top: 3px; padding-bottom: 3px;"><a href="docs/ondfi5.html" style="">Î êîìïàíèè</a><br/></div><div style="padding-left: 10px; padding-top: 3px; padding-bottom: 3px;"><a href="docs/8qjisp.html" style="">Óñëóãè</a><br/></div><div style="padding-left: 10px; padding-top: 3px; padding-bottom: 3px;"><a href="
This would be an epic password if someone did in fact use it, but it's almost certainly an upstream parsing error. Or take this password:
welcometomykitchen12345678
Yes, I can envisage someone using it on a website (perhaps one related to cooking), but no, I don't believe it would have been used 6,349 times which is the number of occurrences that were found within the breach corpus. Interestingly, they were all sourced from "www.vcanbuy.com {134.303} [HASH] (Business and Industry).txt" and as best I can make it, vcanbuy.com is a Thai fashion site. But neither of these data quality issues matter - here's why:
When these passwords flow through into Pwned Passwords, they ultimately exist as hashes to be downloaded or queried using k-anonymity. Nobody is going to use the first password with all the HTML in it so it has no real world impact. Someone might feasibly try to use the second password and a service using HIBP's Pwned Passwords might then reject it due to its prevalence. I'm ok with that because it's not a good password! But what about hash collisions? What if someone else tries to use a password where the SHA-1 hash is equal to the SHA-1 hash of the junk data? It'd return a hit in HIBP which would effectively be a false positive, but whether there's a small amount of junk data in there or not (and it's a very small amount - well under 1%), the same issue prevails. Plus, considering that SHA-1 hashes occupy a total character space of 16^40, you can easily do the maths on how extremely unlikely this is (and the impact is still very low if it does).
Given the number of individual breaches, the legitimacy of the data plus the vast number of previously unseen email addresses and passwords, I've loaded it all into HIBP. The lot - both emails and passwords (note: these go in as separate archives and never as pairs, read more about Pwned Passwords here). As with other breaches without a single clear origin, this means that people may find themselves pwned and not know which service leaked their data. It also means they may find their password breached and not know which service leaked it. But it also doesn't matter - here's why:
The goal of HIBP has always been to change behaviours, namely to move people from using those one or two or three weak passwords all over the place and get themselves into a proper password manager like 1Password and create strong, unique passwords everywhere (full disclosure: I'm on their board of advisors). If you've done that already and then find yourself in the Cit0day data then it's a non-event for two reasons:
- Being in one of the 23k breaches isolates your risk to that breach alone; because you've not reused the password anywhere else, exposure in that one place doesn't put you at risk anywhere else.
- Passwords randomly generated from a password manager are almost certainly not going to be cracked; even when stored weakly (for example, as an unsalted MD5 hash), your ~40 character random string isn't being cracked. If, on the other hand, the site stored it in plain text, see point 1.
And if you don't already have a password manager? Then you need to get one and promptly change the password on every important account anyway!
But there is a gap that goes beyond the risks associated with exposed passwords alone, and that's the personal impact of other exposed data. If, for example, you filled a bunch of other personal information into Chordie then it would be reasonable to assume that's now in the possession of other parties and you would quite rightly want to know about that. This is where we really need the sites indicated in those two Gists above to come forth and I suggest the following: If they're on the list, test a sample set of their own subscriber's email addresses on HIBP. If you're worried about submitting someone else's personal info to my service, grab some Mailinator addresses and check those. If they come back with hits against the Cit0day breach then that's a very strong indication of breach.
In closing, there's now 226M more breached accounts in HIBP and a further 41M passwords (just over 40M new ones from this incident and just under 1M from other incidents since the last release). Just to emphasise why it was important to get this data set into HIBP, the Pwned Passwords k-anonymity API has been hit 815M times in the last month:
Feeding these passwords into the corpus of known breached ones has an immediate an tangible impact on account takeovers which is good for online services, good for individuals and good for the web as a whole.
A last word on this: please don't contact me and ask for details on the breach your address was in or the password used, I operate this as a free service in my available time and don't have the capacity to reply to even a tiny fraction of the 226M people in this incident. Get a password manager, use strong and unique passwords, that is all.