⚡ Where Algorithmic Transparency Meets Community

How algorithms support contributors at Wikipedia

In elementary and middle school, I would hear this refrain from my teachers constantly: “Wikipedia’s not a reliable source.” Yet today, it’s Wikipedia that populates the information surfaced by many search engines — not just in the main list of results, but in topic summaries explicitly highlighted by Google (or DuckDuckGo/ Ecosia/ Bing). In other words, Wikipedia is here for good as as a critical component of our information ecosystem.

I’m super excited to share today’s interview with Hal Triedman (he/him), a Privacy Engineer working on transparency and machine learning at the Wikimedia Foundation.


🌐 where algorithmic transparency meets community

This interview has been edited for length and clarity.

What do you work on at Wikimedia? 

If you were to abstract everything that I do into a single phrase, it would be algorithmic transparency. That encompasses making model cards and trying to make sure that people are aware of and understanding the models that are affecting their experience on the platform, but it also includes things like creating transparency in the sense of releasing more data. We don't collect a lot of data — the privacy policy explicitly states that the Wikimedia Foundation minimizes the amount of data collected — but we're a really big website. Wikipedia gets somewhere on the order of 13 billion pageviews per month, which puts us in the top five most looked at sites in the world. We're the only nonprofit in the top five, and because of that, there’s a unique responsibility to shine some light into how the Internet is actually working. 

What kinds of algorithms is Wikipedia using? If I were interacting with the site, where would I be actually interfacing with an algorithm? 

We have 150-200 models spanning 30-40 languages; for the most part, they’re designed explicitly as human in the loop systems. Only in very rare cases will a decision be made solely based on a model prediction without a human sign-off. 

One example of how models are used is edit review. Let's say somebody goes to the Wikipedia page for OpenAI and ― anyone can edit Wikipedia ― let’s say they really don't like what OpenAI is doing. They hit the edit button and they type something vulgar. That’s not a good faith edit, and we have a model to check for that: saying this has a 75% chance of being a bad faith edit. This model is about trying to make the life of editors who are trying to sign off on edits a lot easier by prioritizing the edits that are potentially going to be harmful to the community and the quality of information. 

These aren’t super advanced models, to be clear, more like a random forest classifier than GPT-3. 

Other examples of where we use algorithms: classifying the quality of articles ranging from stubs to fully fleshed out articles which we call Featured Articles; those are the articles we could feature on the home page because they're incredibly high quality. There’s also topic classification ― sports, biochemistry, philosophy. Across languages the models are often quite similar in architecture since the tasks are quite similar, and because we’re looking at features of individual edits rather than features of big chunks of text.

From an organizational perspective, how and where did this push for transparency start?

We're not a very large organization, probably something like 500 employees; of those 500, maybe 300 people are technologists. So it's a small staff working on a large ship. A lot of these questions about transparency have been in the air from people at the very top ― director level, VP level, all the way down to individual contributors for years now; these conversations were definitely happening before I got there in April 2021. It’s a pretty institutional concern. Wikipedia is in an interesting position as an incredibly large presence on the web; and people who engage with us, whether they’re donors or editors or contributors, generally do so with the assumption that the wiki process will eventually converge on some semblance of “good information,” of neutrality. Whether that’s a fair assumption to make is probably outside of my purview ― but there’s a lot at stake, so we want to show everyone our cards, do everything open source, and in theory, make it possible for everyone to understand or for interested outsiders to check our work to make sure that we're doing things the right way.

Can you talk a little more about the scale of Wikipedia (and its contributors), and how that affects Wikimedia policies? 

Let’s zoom out for a second. There are hundreds of millions ― maybe billions ― of individual people who are seeing a Wikipedia page on a regular basis; within that haystack, there are probably a couple hundred thousand people who are editors, who are really interested in their specific subject area ― military history or organic chemistry or scientific papers ― and they keep their little corner of Wikipedia up to date as best they can. Within that group of editors, there is also a smaller group, let’s call them super editors, who are more oriented towards the organization; that’s probably a couple of thousands people. So have an outspoken, vibrant, intense community of a couple thousand people who are poring over everything that we [Wikimedia Foundation] do and giving their input on our decisions; there are also researchers who are looking at how our decisions are affecting, for example, gender parity among new article creators or article content. So there’s a built-in urge to transparency as a function of the fact that we as an organization rely so heavily on this external community.

Still, there are sometimes also cases where that runs into privacy issues ― we don't want to accidentally release user data, or break anonymity, or, more concretely, cause a government to take adverse action against an editor. So we do have some conflicting urges, between the urge for radical transparency and the urge to make sure that all of our users are safe. 

Wikipedia thrives off a robust, decentralized community of contributors and decisionmakers, and yet from a technical perspective is a collection of static pages with links to one another. Do you have any thoughts about what Wikipedia might tell us about Web3, and how the ecosystems there might develop? 

I'm by no means a Web3 expert, and Wikipedia is not perfect, but I think that overall, out of all the technology that came out of the early 2000s, I think that Wikipedia has stood the test of time as a pretty unequivocally good force ― especially as we now exist in this world of fractured epistemologies, and untruths, and the mistrust that pervades the social fabric in America but also all over the world. Wikipedia’s not going to solve every problem but it’s a pretty good thing that came from that era. 

The interesting thing about Wikimedia [as a technical organization] is, from 2001 until around 2014, 2015, it was run by very few people. They hacked together some code and got some servers up, and held it together for more than a decade with shoestring and duct tape. What we see on Wikipedia today is really about the community and the mission much more than the technology. 

A take away that I've increasingly had ― as someone who's neither a techno-optimist nor a techno-pessimist but more of a techno-realist ― is that the technology is not going to be the thing that makes it work. It’s less about the technical features of a platform ― immutability, provenance, et cetera ― and more about the community, the norms, the people. In other words, the squishy stuff around the edges. 

What have you been reading recently?

Lots of history and fiction — right now I’m reading Devil in the White City by Erik Larson, Empress Dowager Cixi by Jung Chang, and Midnight’s Children by Salman Rushdie. I just finished The Sympathizer by Viet Thanh Nguyen.

Find more of Hal on Twitter, on Goodreads, or at his website.


Reboot is a volunteer-run labor of love. To receive future essays and support our work, consider becoming a free or paid subscriber!


🌀 microdoses

💝 closing note

Last week, we published a proposal for a corporate wealth tax in A Tax on Tech-quity. Reader Chris Beiser comments: “One adverse consequence of taxing equity is that it advantages fast-growing companies like startups, which have only existed for a short time, at a cost to businesses that grow slower, in more prosaic parts of the economy. This would push more capital towards "those who own the technology companies building these advances” — the opposite of what this proposal claims to seek.” His comment is worth reading in full — check it out below.

Reboot
⚡️ A Tax on Tech-quity
This week, we’re sharing something a bit different: a tax policy proposal from AI policy aficionado, Kernel Magazine author, and Reboot community member Chris Painter. While this essay seems more economics than technology, we think it gets to the heart of debates over how to ensure technological advancement benefits everyone — not just the biggest corpo…
Read more

What are your Wikipedia core memories? Did you also get yelled at by your teachers? Are you a contributor or editor for Wikipedia, and what’s your experience been like? Reply to this email or comment below — we’d love to hear from you.

See you on the Wikipedia edit logs,

Reboot team

A guest post by