tl;dr - a collection of nearly 3k alleged data breaches has appeared with a bunch of data already proven legitimate from previous incidents, but also tens of millions of addresses that haven't been seen in HIBP before. Those 80M records are now searchable, read on for the full story:
There's an unknown numbers of data breaches floating around the web. There are data breaches we knew of but they just took years to appear publicly (Dropbox, LinkedIn), data breaches we didn't know of that also took years to discover at all (Disqus, imgur) and indeed, data breaches that were deliberately covered up (Lifeboat, Uber). But I suspect the another big slice of data breaches are the ones that both the website operators themselves and the general public know nothing about, the "unknown unknowns", as it were. By it's very nature, we don't know how big this list is, but "very big" would be a pretty safe bet.
In running Have I Been Pwned (HIBP) these last 4 and bit years, one of the things the constantly amazes me is the breadth of data breaches individuals often collect. People hoard it, swap it, crack it, sell it and occasionally, just redistribute it all publicly. I regularly see these massive lists of breaches belonging to a personal stash, often numbering in the hundreds of incidents and frequently containing data I've seen circulating before. Today, however, I came across something a bit different by way of a story from last week titled 3,000 Databases with 200 Million Unique accounts found on Dark Web. Now, as I said only a couple of weeks ago, I'm immediately suspicious when people start saying "dark web". That 1.4B list I reference in that post, for example, was almost entirely data I'd seen before and it was being distributed via Reddit, "the front page of the internet". But these things are always worth a look anyway so I set about locating the data.
After some number of single-digit number of minutes looking for it, someone pointed me to a well-known hacking forum with a post from 4 days before the story mentioned above. Consistent with my aforementioned "debunking the dark web" blog post, the forum in question is located in the "very clear web" and is easily discovered (although I'm not going to make it any easier here). It then links directly through to 8.8GB worth of easily downloadable data breaches, all obtainable in a single ZIP file. In total, there were 2,889 text files in the archive but it's what's inside them which I found particularly interesting.
Almost all the files are just email addresses and plain text passwords (the occasional file has a username that's not an email address and a password). This is interesting in that it's reminiscant of the Explouit.In and Anti Public credential stuffing lists I loaded back in May. However, in those cases they were single lists amalgamated from multiple sources whilst in this case, we're looking at individual website names that appear to have had merely the credentials extracted from the source data breaches. It's also interesting because among nearly 3k other breaches, the data contains Dropbox. I wrote about the Dropbox breach back in August last year and I pointed out the structure of the breached files at the time:
There were 2 files with bcrypt hashes and 2 with SHA-1 hashes. But here's what was particularly interesting:
the bcrypt accounts include the salt whilst the SHA1 accounts don't
In other words, you've got one highly resilient hashing algorithm in bcrypt (work factor of 8) and one fairly weak one in SHA-1, albeit without the salt which would usually be needed to crack it. But there's 18.6M rows of email addresses and plain text passwords in this new file, so where are the passwords from? I grabbed a few email addresses then went back to the original data breach and pulled the corresponding records for them. I then tested them against an online bcrypt hash generator to see if the passwords in the new set of data were the ones used in the original Dropbox breach. Here's a couple that matched:
40330140 : $2a$08$g5lpS.MIENA68r98kcCG0.g0qF3Hu9C97dKC9BvsP4Z8S.4rGn9By
ledzep69 : $2a$08$8hDGT.2ofu8P7G10mWiy/.tvOjubQkMGAppbOMf4xAcgaTfsDC9VG
But they didn't all match, in fact most I tested didn't. They didn't all come from the bcrypt files in the Dropbox data either, a bunch were from the SHA-1 files which had no salt. So what can we conclude from this? Well firstly, Dropbox allowed some pretty atrocious passwords at one time there! And secondly, these passwords have almost certainly not been cracked out of the Dropbox data otherwise I would have found a lot more matches (I tested some pretty terrible passwords too). But it does contain email addresses from the Dropbox breach (none of the ones I tested weren't in the original breach) and we know people reuse passwords so the logical conclusion is that someone has joined email addresses from one source with passwords from another.
Moving on, regardless of how the data inside the files was put together, I wanted to get a sense of how many of them were new versus incidents I'd seen before. Any existing breaches in HIBP that I could identify in this new set were omitted. Some of them were obvious, for example Dropbox and MySpace so I pulled these (among a handful of others) out. I then grabbed a unique set of addresses from the remaining data and tested a random 10k of them against HIBP. Only 70% of them were already in the system which indicates a lot of new data; 30% of the addresses I'd never seen before. Of course, of the ones I had seen before there'd still be many addresses in data breaches that weren't in HIBP and the addresses had simply been pwned more than once, but the checks against the system also gave me an opportunity to do a bit more source cleanup.
In analysing the results of the HIBP checks, further duplication came to light. For example, the largest remaining file after my initial cleanup was "SGB.net.txt" but the domain sgb.net is presently parked and archive.org doesn't show anything of substance on it in the past either. But when checking the data against HIBP, I kept getting hits against the Lifeboat data breach . That site runs on lbsg.net which is not too dissimilar to the filename in the set I was dealing with here. The file named "Alpari.com.txt" was full of Chinese addresses and constantly showed hits against the NetEase and Aipai.com breaches. Given that alpari.com is a financial services site located in the Caribbean, something doesn't add up here so I removed that one as well.
In total, I distilled the data down to 2,844 files which contained a total of 80,115,532 unique email addresses. Another sample set of these was showing much closer to only 66% of them having been in HIBP already which is much closer to the normal "hit rate" I see with a brand new, genuine data breach (have a read through the HIBP Twitter feed to see actual numbers). This was now data I was comfortable loading because we're talking tens of millions of people in (alleged) breaches I've never seen before. But I'm also conscious that I can't clearly say "this is the breach you were in" as there's no direct association between the accounts in HIBP and the source file. However, I can list those source files in the hope that it'll help people who might recognise a service they've used in the past. Here's the complete list:
It should be abundantly clear from this post, but let me explicitly state it anyway: I have no idea how many of these are legitimate, how many are partially correct and how many are outright fabricated. I've consequently flagged this "breach" in HIBP as unverified. However, I can confidently say that amongst this set was a large number of records in breaches that I've previously verified and that per the Dropbox example, there are passwords that have been used by the email addresses they're associated with. I'm conscious that people can be left feeling like they don't know what action to now take, but when I've asked in the past people are overwhelmingly in favour of knowing where their data has been exposed. As with almost every other data breach, treat this as a reminder of how important a dedicated password manager is for ensuring all your passwords are unique and genuinely strong. Read the only secure password is the one you can’t remember for more on that.
Finally, if you find your data in this set and recognise the source of it from the list above, do leave a comment below as it may help others identify where their information has been exposed.
Edit: For people asking for the source data or passwords, please read No, I cannot share data breaches with you and Here are all the reasons I don't make passwords available via Have I Been Pwned.