If I’m honest, the success of Have I been pwned? (HIBP) took me by surprise. It started out as an intriguing exercise to look at how the same accounts were being compromised across multiple data breaches and morphed into something well beyond that in pretty short order. The unexpected success of the service made for some really intriguing technology challenges and provided me with an excellent opportunity to push Microsoft’s Azure to the limits, not just in terms of performance, but in how I could engineer the whole thing to cost me just about nothing to run.
For example, within the first week after I launched it, the service got too big for Google Analytics as it was already tracking over 10 million hits a month. I had to optimise quickly for the unexpected success as unforeseen things began happening, things like serving tens of GB of jQuery in a day which is not something I needed to pay for if I used a CDN. As I rolled out new features, I found new challenges. The API is a great example; automation of queries makes for a system that can go from hundreds of requests a minute to tens of thousands in the blink of an eye. How do you support that massive change in scale and not break the bank, particularly when the service is out there for free? As I wrote recently, this challenge lead to me optimising the storage to the point where it returns records in single digit milliseconds and only costs $2.50 a month for the Table Storage mechanism that drives it.
One of the things that really surprised me is the amount of media coverage the service got. I’m not into plastering the logos of the various organisations that covered it over the homepage, but I do track the larger ones on a press page. There’s not even a link to this on the site but I wanted a record of it and I reckon it’s a pretty good one; a couple of Time magazine articles, USA Today, multiple pieces from Forbes, various other consumer-centric stories then of course lots of tech coverage like ZDNet, Gizmodo, Ars Technica, PC World etc. There’s a huge amount of foreign language coverage too which was a bit unexpected.
I’ve also use HIBP as an opportunity to write extensively about how I’ve pulled together all the technology bits to make this work as well as it does. If you have a spin through the tag on my blog, you’ll find a huge amount of info which by all accounts, has been enormously useful for other people building online services. I’ve also been as open as I can about this but there’s one piece that’s been ticking away in the background that I’ve not shared and that’s what I want to write about today because it signals a new chapter for HIBP.
It’s free and it will (almost certainly) scale further than you can
Late last year I had a chat with an organisation about how they could use the data to support their customers if they’re caught up in a data breach. I’ll talk more about the nature of those customers in a moment, but I want to share the early nature of the discussion first as it played out in a familiar way:
Org: We’d like to query your API to check if our customers have been compromised online, how much does it cost?
Me: Well it’s free, just go for it.
Org: No really, we’re going to query a lot of customers, that must cost something?
Me: I’ve tested it with 380k queries a minute! Infrastructure will scale out and magic will happen, so long as you’re not maxing it out for perpetuity, just go for it.
And so they did and I paid some small amount of money to enable them to do it. To that effect, I’ve never once seen sufficient load to cause infrastructure to scale to a point where I’ve actually been worried about the cost. It can scale to that point (I’ve tested it), but it’s a problem I just haven’t had to deal with yet.
But this exercise actually lead into a more interesting discussion that’s resulted in me building out the features I’m going to write about here. You see, this particular organisation was left with a conundrum; yes, they’d been able to check all their customers against the freely available API, but what about the future? I mean what happens when one of them gets pwned in a subsequent breach? The org can’t perpetually hit my API with every single customer on a regular cadence. The new feature solves this problem, but let me first tell you about this company and why what they’re doing is such a great use case for HIBP.
MyLife and callbacks
MyLife spends their days helping people understand where they’re exposed on the web. They provide various services which include identity theft protection, something that’s obviously very closely linked to online data breaches given that many of those expose data that attackers then monetise. For an organisation like MyLife, monitoring the web for their customers’ identities is an important part of the service they provide. The better they can do that and the earlier they can be made aware of an event that impacts one of their customers, the more valuable their service is.
For a use case like this, what’s needed is proactive notifications where HIBP can tell them when something happens to one of their customers. I already had an email notification services that individuals can sign up to with their own account plus of course the domain search feature for when an address @[whatever] showed up, but this wasn’t going to cut it. For one, ownership of the email or domain had to be verified which was infeasible for their requirement, but the other problem was that email is by no means an appropriate construct by which to notify an organisation the size of MyLife (more on scale soon). So I built web hooks.
Here’s how it works: there’s an authenticated API by which an organisation using this service can either subscribe or unsubscribe an email or a domain – any email or domain – and there’s an administration portal where they then configure their own service details. What this means is that they’re setting up a service such as https://acme.com/hibpcallback then when one of their subscribed email addresses is seen by HIBP, that service receives a request explaining what the incident was. For example, they may subscribe foo@bar.com and when that email addresses appears in a paste, moments later the organisation receives a notification of the incident. For example, they get a request with a JSON body like this:
{ "Email":"foo@bar.com", "Breach": { "Title":"Pokemon Creed", "Name":"PokemonCreed", "Domain":"pokemoncreed.net", "BreachDate":"2014-8-8", "AddedDate":"2014-08-10T00:03:59Z", "PwnCount":116465, "Description":"In August 2014, the Pokemon RPG website <a href=\"http://pokemoncreed.net\" target=\"_blank\">Pokemon Creed</a> was hacked after a dispute with rival site, <a href=\"http://pkmndusk.in\" target=\"_blank\">Pokemon Dusk</a>. In a <a href=\"https://www.facebook.com/ramandeep.s.dehal/posts/749666358442465\" target=\"_blank\">post on Facebook</a>, "Cruz Dusk" announced the hack then pasted the dumped MySQL database on <a href=\"http://pkmndusk.in\" target=\"_blank\">pkmndusk.in</a>. The breached data included over 116k usernames, email addresses and plain text passwords.", "DataClasses":["Email addresses","Genders","IP addresses","Passwords","Usernames","Website activity"] } }What the recipient of the callback then does with it is entirely up to them; log it into their own system, notify the individual, flag it in some sort of a customer portal – whatever, HIBP is effectively a white-labelled service. Same deal with domains – if they’re monitoring bar.com and then fiz@bar.com and buzz@bar.com appear in a new breach, they’ll get one callback (this way it’s atomic for the domain) listing the impacted email addresses on that domain and of course the details of the incident they were found in.
Here’s how it all looks:
The history of the callbacks can then be reviewed in a dedicated HIBP portal:
This is a beautifully simple, elegant solution that’s been working flawlessly. It’s highly resilient, extremely reliable and has proven enormously successful. Let me give you a sense of just how successful…
Hundreds of thousands of callbacks later…
MyLife loaded over 10M of their customers into HIBP. That’s right – ten million – which at least in my book, is a serious number. What this means is that they’re seeing serious volumes of callbacks. For example, I checked in after the first five and a half days of the service running and saw there’d been over 5,000 callbacks. This was from a combination of 563 pastes that were atomically imported into the service plus one data breach.
When the Adult Friend Finder incident happened in May, MyLife received over 74k callbacks. Put that in context for a moment: that’s 74k opportunities they have to reach back out to customers and let them know that something important has just happened to their account. Seventy-four-thousand! These numbers are significant and are now into the hundreds of thousands which is enormously valuable for a company that depends on being able to monitor their customers’ exposure and communicate with them accordingly when something of significance happens.
So all this is great but let me get to the crux of it – this is now being offered as a commercial service. Let me explain.
Commercialising enterprise subscriptions
Let me talk a bit about money before I talk about what I’m now doing with the service. Clearly, I’ve created something of value and I don’t just mean this new feature, but the HIBP service in general. There have been many, many suggestions of how to commercialise the service and indeed offers to invest money to build it out into something larger. There are competing services that do charge money for the things I already do for free. There have been multiple companies offering to sponsor or advertise on the site or do any number of things to put money in my pocket for it… and I’ve rejected every single one. The only money I’ve ever made out of HIBP is that which people have willingly given of their own free volition after using it and that’s by way of donations (and thank you kindly to all those who have!)
I don’t want the overtly independent nature of the site diluted by having someone else’s brand on it. I don’t want to stop individuals and organisations from assessing their own exposure by charging for the services that helps them identify malicious activity against them on the web. I also don’t want this to become some sort of VC-driven commercial machine that loses its independence and reason for being. I could have turned around and stood up a payment gateway and started putting prices on things, but I don’t want to have thousands of micro-customers using small parts of the system; it’s a lot of support overhead and again, I want this to remain freely accessible to those who need it most.
What I’ve ultimately decided to do is focus instead on the MyLifes of the world; the larger organisations that want to use this system at scale to make a difference to their customers. I want a small number of organisations that I actually want to have a relationship with; I want to understand how this can help them, get involved in their implementation and then ensure HIBP continues to evolve in a way that supports these needs. This may be the sorts of organisations that play a role in cleaning up after the likes of this crazy government data breach (it’s 22M people now apparently), it could be financial institutions that can make good use of the data for threat intelligence purposes or it could even be telcos looking for a value-add for the customers that they’re putting on the web. In fact it’s all these and I know it’s all these because I’m talking to them already. And I’m charging for it.
I gave it a lot of thought and there really isn’t a way of defining a single cost model that makes sense. Different organisations derive different value from the service be that through the size of the audience they want to monitor or the nature of the service they’re providing to those individuals. They need different levels of support to help bring the service to fruition for them and they need different levels of my expertise to help make sense of the data. So the cost is a discussion that I’m having individually with each consumer of the service. Clearly by giving so much away for free for so long my goal is not to jump in and monetise it to the hilt, I want the right relationships with the right organisations that can do something genuinely meaningful with the service.
Understanding data volumes
Before I wrap up, I want to give a little bit of context to the sort of data volumes HIBP is now dealing with which will help put the value proposition of having access to this in context. The obvious headlines are the public ones on the site:
But let me give you a little more detail:
- In the last three months, there have been 3.7M email addresses retrieved from almost 6k pastes at a rate of more than 40k a day
- In that three months there were also 9 verified data breaches added with 9.7M email addresses
- In a typical week, HIBP is presently serving 4.5M requests across both the API and organic web traffic
- There are now 126k verified subscribers (they’ve confirmed their email address) monitoring pastes and breaches
- It is still very, very fast:
Loving the way @haveibeenpwned is scaling lately - check the response times across near zero to 16k requests per min pic.twitter.com/SpNNoi7szY
— Troy Hunt (@troyhunt) July 17, 2015
Obviously all this goes in ebbs and flows too; sometimes I’ll see over half a million email addresses in pastes in one day, sometimes I’ll load 4M accounts in one go from a single big data breach and sometimes I’ll see going on a million requests in an hour. But across all of this, HIBP continues to perform beautifully and I’m really excited about the potential to now push it much further as people find all new ways of using the feature I’ve described here.
So what’s next?
On the surface, nothing changes with HIBP. The same service works the same way and costs the same zero dollars as they always did. The difference now is that behind the scenes I get busier helping organisations make the most of the data I already have and continue to obtain. Now that may well mean more trickles back into the public-facing side of it as well; perhaps more data sources or other features I either haven’t thought of or just haven’t gotten around to building yet. The point is that I’ve every intention of keeping what people know just as it is or making it better, but not charging for it or whacking ads on it or anything like that.
The main thing now is to focus on those new enterprise use cases. I’m already talking with a bunch of organisations of different natures so things are moving forward, if this service makes sense to an organisation you’re involved in then I’d love to chat.
Also, regardless of your commercial interests, if anyone has feedback, suggestions or other commentary, please leave them in the comments below. I’ve brushed over a number of things for the sake of brevity which I suspect will leave a few unanswered questions, let’s take them below if you’re curious.