I feel the need, the need for speed.
Faster, Faster, until the thrill of speed overcomes the fear of death.
If you're in control, you're not going fast enough.
And so on and so forth. There's a time and a place for going fast, and there's no better place to do that than when querying Have I Been Pwned's Pwned Passwords service. (Ok, a lot less glamorous than the context of the previous statements, but also less likely to have a catastrophic outcome.)
In December last year, Pwned Passwords saw not just a fresh batch of 225M new passwords from the NCA, but it also welcomed the ongoing ingestion of new passwords from the FBI. This created a lot of excitement which is great, but it also led to a very important question: what's the fastest way to query the entire corpus of data? That API is returning more than 99% of queries from the Cloudflare edge so it should be super fast, but how fast? What happens if you want to check millions of passwords? As all the Pwned Passwords code is now open source, we thought it would be cool to open this challenge up to the community and see what you can come up with in terms of an app to do just this. And, in the spirit of open source, that code should be available to all so that your good work can benefit the masses.
Here's what we've done: chief Pwned Passwords wrangler Stefán Jökull Sigurðarson has stood up a repository which you can begin working on right now. If you'd like to contribute in another language, leave a comment below and we'll create an all new repo in your favourite language; folks using this code need to be able to read, understand and trust the code so the more the merrier. There's no API key for this service so no secrets management, just plain and simple code that queries well documented existing APIs.
A few little tips for you:
- You should be able to read in a collection of plain text passwords from a text file and write out a set of results where for each password, the prevalence with which it's been seen is emitted in a readily parseable format. For example, a CSV file of plain text password and prevalence pairs.
- There are only 16^5 different possible API queries due to the way the anonymity model works (check the docs - you're only querying the first 5 chars of a SHA-1 hash). Particularly as the number of passwords checked increases beyond the ~1M possible unique queries, the time per password query should massively decrease...
- ...depending on how much data you can cache locally without the need to re-query a remote service. Think about using local storage or memory constructs such that you never need to go back out "over the wire" for subsequent queries of the same hash prefix.
- If you'd like a sample password set to test on, try the top 100,000 most prevalent password in Pwned Passwords, courtesy of the NCSC (that'll obviously have a 100% hit rate).
- If you'd like an even larger sample set, try the 14M "Rockyou" list (that'll also give you a 100% hit rate as all of those were loaded into HIBP very early on).
- Report on speed. Include a summary at the end of the process that provides - at the very least - the number of passwords checked, the percentage found in HIBP, the total duration and the number of passwords checked per second. This will help us race different variants of the challenge 😎
- That's all, the repo is here, go for it: https://github.com/HaveIBeenPwned/PwnedPasswordsSpeedChallenge
And that's pretty much it. This is a pretty simple little challenge that I hope you can have some fun with, but it's also a challenge that will do a great deal of good for many organisations and individuals alike.
Stefán has already made a head start on a C# .NET version and has achieved some blistering results:
Mileage will vary based on factors such as cache level (his Cloudflare edge node already had 100% of those requests cached) and bandwidth (let's just say Reykjavík doesn't have Australia level connectivity), but those 3k+ requests per second are a pretty good benchmark to begin with. Wanna go really fast? Check out the speed against locally cached passwords:
So, can you beat it in your language of choice? Give it a go 🙂
And just to save this coming up in the comments, no organisation should be storing customer passwords in a format they could readily feed into this challenge. Where this is useful is for cases where passwords have been obtained in plain text and that ranges from credential stuffing lists to malware campaigns to law enforcement agencies identifying compromised passwords in the course of their investigations.