-
Reranking Social Media Feeds: A Practical Guide for Field Experiments
Authors:
Tiziano Piccardi,
Martin Saveski,
Chenyan Jia,
Jeffrey Hancock,
Jeanne L. Tsai,
Michael S. Bernstein
Abstract:
Social media plays a central role in shaping public opinion and behavior, yet performing experiments on these platforms and, in particular, on feed algorithms is becoming increasingly challenging. This article offers practical recommendations to researchers developing and deploying field experiments focused on real-time re-ranking of social media feeds. This article is organized around two contrib…
▽ More
Social media plays a central role in shaping public opinion and behavior, yet performing experiments on these platforms and, in particular, on feed algorithms is becoming increasingly challenging. This article offers practical recommendations to researchers developing and deploying field experiments focused on real-time re-ranking of social media feeds. This article is organized around two contributions. First, we overview an experimental method using web browser extensions that intercepts and re-ranks content in real-time, enabling naturalistic re-ranking field experiments. We then describe feed interventions and measurements that this paradigm enables on participants' actual feeds, without requiring the involvement of social media platforms. Second, we offer concrete technical recommendations for intercepting and re-ranking social media feeds with minimal user-facing delay, and provide an open-source implementation. This document aims to summarize lessons learned, provide concrete implementation details, and foster the ecosystem of independent social media research.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM
Authors:
Michelle S. Lam,
Janice Teoh,
James Landay,
Jeffrey Heer,
Michael S. Bernstein
Abstract:
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online…
▽ More
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Social Skill Training with Large Language Models
Authors:
Diyi Yang,
Caleb Ziems,
William Held,
Omar Shaikh,
Michael S. Bernstein,
John Mitchell
Abstract:
People rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Drawing upon interdisciplinary research from communication and psychology, this perspective paper id…
▽ More
People rely on social skills like conflict resolution to communicate effectively and to thrive in both work and personal life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Drawing upon interdisciplinary research from communication and psychology, this perspective paper identifies social skill barriers to enter specialized fields. Then we present a solution that leverages large language models for social skill training via a generic framework. Our AI Partner, AI Mentor framework merges experiential learning with realistic practice and tailored feedback. This work ultimately calls for cross-disciplinary innovation to address the broader implications for workforce development and social equality.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Form-From: A Design Space of Social Media Systems
Authors:
Amy X. Zhang,
Michael S. Bernstein,
David R. Karger,
Mark S. Ackerman
Abstract:
Social media systems are as varied as they are pervasive. They have been almost universally adopted for a broad range of purposes including work, entertainment, activism, and decision making. As a result, they have also diversified, with many distinct designs differing in content type, organization, delivery mechanism, access control, and many other dimensions. In this work, we aim to characterize…
▽ More
Social media systems are as varied as they are pervasive. They have been almost universally adopted for a broad range of purposes including work, entertainment, activism, and decision making. As a result, they have also diversified, with many distinct designs differing in content type, organization, delivery mechanism, access control, and many other dimensions. In this work, we aim to characterize and then distill a concise design space of social media systems that can help us understand similarities and differences, recognize potential consequences of design choices, and identify spaces for innovation. Our model, which we call Form-From, characterizes social media based on (1) the form of the content, either threaded or flat, and (2) from where or from whom one might receive content, ranging from spaces to networks to the commons. We derive Form-From inductively from a larger set of 62 dimensions organized into 10 categories. To demonstrate the utility of our model, we trace the history of social media systems as they traverse the Form-From space over time, and we identify common design patterns within cells of the model.
△ Less
Submitted 23 March, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Clarify: Improving Model Robustness With Natural Language Corrections
Authors:
Yoonho Lee,
Michelle S. Lam,
Helena Vasconcelos,
Michael S. Bernstein,
Chelsea Finn
Abstract:
In supervised learning, models are trained to extract correlations from a static dataset. This often leads to models that rely on high-level misconceptions. To prevent such misconceptions, we must necessarily provide additional information beyond the training data. Existing methods incorporate forms of additional instance-level supervision, such as labels for spurious features or additional labele…
▽ More
In supervised learning, models are trained to extract correlations from a static dataset. This often leads to models that rely on high-level misconceptions. To prevent such misconceptions, we must necessarily provide additional information beyond the training data. Existing methods incorporate forms of additional instance-level supervision, such as labels for spurious features or additional labeled data from a balanced distribution. Such strategies can become prohibitively costly for large-scale datasets since they require additional annotation at a scale close to the original training data. We hypothesize that targeted natural language feedback about a model's misconceptions is a more efficient form of additional supervision. We introduce Clarify, a novel interface and method for interactively correcting model misconceptions. Through Clarify, users need only provide a short text description to describe a model's consistent failure patterns. Then, in an entirely automated way, we use such descriptions to improve the training process by reweighting the training data or gathering additional targeted data. Our user studies show that non-expert users can successfully describe model misconceptions via Clarify, improving worst-group accuracy by an average of 17.1% in two datasets. Additionally, we use Clarify to find and rectify 31 novel hard subpopulations in the ImageNet dataset, improving minority-split accuracy from 21.1% to 28.7%.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Rehearsal: Simulating Conflict to Teach Conflict Resolution
Authors:
Omar Shaikh,
Valentino Chai,
Michele J. Gelfand,
Diyi Yang,
Michael S. Bernstein
Abstract:
Interpersonal conflict is an uncomfortable but unavoidable fact of life. Navigating conflict successfully is a skill -- one that can be learned through deliberate practice -- but few have access to effective training or feedback. To expand this access, we introduce Rehearsal, a system that allows users to rehearse conflicts with a believable simulated interlocutor, explore counterfactual "what if?…
▽ More
Interpersonal conflict is an uncomfortable but unavoidable fact of life. Navigating conflict successfully is a skill -- one that can be learned through deliberate practice -- but few have access to effective training or feedback. To expand this access, we introduce Rehearsal, a system that allows users to rehearse conflicts with a believable simulated interlocutor, explore counterfactual "what if?" scenarios to identify alternative conversational paths, and learn through feedback on how and when to apply specific conflict strategies. Users can utilize Rehearsal to practice handling a variety of predefined conflict scenarios, from office disputes to relationship issues, or they can choose to create their own setting. To enable Rehearsal, we develop IRP prompting, a method of conditioning output of a large language model on the influential Interest-Rights-Power (IRP) theory from conflict resolution. Rehearsal uses IRP to generate utterances grounded in conflict resolution theory, guiding users towards counterfactual conflict resolution strategies that help de-escalate difficult conversations. In a between-subjects evaluation, 40 participants engaged in an actual conflict with a confederate after training. Compared to a control group with lecture material covering the same IRP theory, participants with simulated training from Rehearsal significantly improved their performance in the unaided conflict: they reduced their use of escalating competitive strategies by an average of 67%, while doubling their use of cooperative strategies. Overall, Rehearsal highlights the potential effectiveness of language models as tools for learning and practicing interpersonal skills.
△ Less
Submitted 29 February, 2024; v1 submitted 21 September, 2023;
originally announced September 2023.
-
Cura: Curation at Social Media Scale
Authors:
Wanrong He,
Mitchell L. Gordon,
Lindsay Popowski,
Michael S. Bernstein
Abstract:
How can online communities execute a focused vision for their space? Curation offers one approach, where community leaders manually select content to share with the community. Curation enables leaders to shape a space that matches their taste, norms, and values, but the practice is often intractable at social media scale: curators cannot realistically sift through hundreds or thousands of submissi…
▽ More
How can online communities execute a focused vision for their space? Curation offers one approach, where community leaders manually select content to share with the community. Curation enables leaders to shape a space that matches their taste, norms, and values, but the practice is often intractable at social media scale: curators cannot realistically sift through hundreds or thousands of submissions daily. In this paper, we contribute algorithmic and interface foundations enabling curation at scale, and manifest these foundations in a system called Cura. Our approach draws on the observation that, while curators' attention is limited, other community members' upvotes are plentiful and informative of curators' likely opinions. We thus contribute a transformer-based curation model that predicts whether each curator will upvote a post based on previous community upvotes. Cura applies this curation model to create a feed of content that it predicts the curator would want in the community. Evaluations demonstrate that the curation model accurately estimates opinions of diverse curators, that changing curators for a community results in clearly recognizable shifts in the community's content, and that, consequently, curation can reduce anti-social behavior by half without extra moderation effort. By sampling different types of curators, Cura lowers the threshold to genres of curated social media ranging from editorial groups to stakeholder roundtables to democracies.
△ Less
Submitted 26 August, 2023;
originally announced August 2023.
-
Embedding Democratic Values into Social Media AIs via Societal Objective Functions
Authors:
Chenyan Jia,
Michelle S. Lam,
Minh Chau Mai,
Jeff Hancock,
Michael S. Bernstein
Abstract:
Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the…
▽ More
Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the political science construct of anti-democratic attitudes. Traditionally, we have lacked observable outcomes to use to train such models, however, the social sciences have developed survey instruments and qualitative codebooks for these constructs, and their precision facilitates translation into detailed prompts for large language models. We apply this method to create a democratic attitude model that estimates the extent to which a social media post promotes anti-democratic attitudes, and test this democratic attitude model across three studies. In Study 1, we first test the attitudinal and behavioral effectiveness of the intervention among US partisans (N=1,380) by manually annotating (alpha=.895) social media posts with anti-democratic attitude scores and testing several feed ranking conditions based on these scores. Removal (d=.20) and downranking feeds (d=.25) reduced participants' partisan animosity without compromising their experience and engagement. In Study 2, we scale up the manual labels by creating the democratic attitude model, finding strong agreement with manual labels (rho=.75). Finally, in Study 3, we replicate Study 1 using the democratic attitude model instead of manual labels to test its attitudinal and behavioral impact (N=558), and again find that the feed downranking using the societal objective function reduced partisan animosity (d=.25). This method presents a novel strategy to draw on social science theory and methods to mitigate societal harms in social media AIs.
△ Less
Submitted 14 February, 2024; v1 submitted 25 July, 2023;
originally announced July 2023.
-
Characterizing Image Accessibility on Wikipedia across Languages
Authors:
Elisa Kreiss,
Krishna Srinivasan,
Tiziano Piccardi,
Jesus Adolfo Hermosillo,
Cynthia Bennett,
Michael S. Bernstein,
Meredith Ringel Morris,
Christopher Potts
Abstract:
We make a first attempt to characterize image accessibility on Wikipedia across languages, present new experimental results that can inform efforts to assess description quality, and offer some strategies to improve Wikipedia's image accessibility.
We make a first attempt to characterize image accessibility on Wikipedia across languages, present new experimental results that can inform efforts to assess description quality, and offer some strategies to improve Wikipedia's image accessibility.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Generative Agents: Interactive Simulacra of Human Behavior
Authors:
Joon Sung Park,
Joseph C. O'Brien,
Carrie J. Cai,
Meredith Ringel Morris,
Percy Liang,
Michael S. Bernstein
Abstract:
Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; t…
▽ More
Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.
△ Less
Submitted 5 August, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Model Sketching: Centering Concepts in Early-Stage Machine Learning Model Design
Authors:
Michelle S. Lam,
Zixian Ma,
Anne Li,
Izequiel Freitas,
Dakuo Wang,
James A. Landay,
Michael S. Bernstein
Abstract:
Machine learning practitioners often end up tunneling on low-level technical details like model architectures and performance metrics. Could early model development instead focus on high-level questions of which factors a model ought to pay attention to? Inspired by the practice of sketching in design, which distills ideas to their minimal representation, we introduce model sketching: a technical…
▽ More
Machine learning practitioners often end up tunneling on low-level technical details like model architectures and performance metrics. Could early model development instead focus on high-level questions of which factors a model ought to pay attention to? Inspired by the practice of sketching in design, which distills ideas to their minimal representation, we introduce model sketching: a technical framework for iteratively and rapidly authoring functional approximations of a machine learning model's decision-making logic. Model sketching refocuses practitioner attention on composing high-level, human-understandable concepts that the model is expected to reason over (e.g., profanity, racism, or sarcasm in a content moderation task) using zero-shot concept instantiation. In an evaluation with 17 ML practitioners, model sketching reframed thinking from implementation to higher-level exploration, prompted iteration on a broader range of model designs, and helped identify gaps in the problem formulation$\unicode{x2014}$all in a fraction of the time ordinarily required to build a model.
△ Less
Submitted 5 March, 2023;
originally announced March 2023.
-
Breaking Out of the Ivory Tower: A Large-scale Analysis of Patent Citations to HCI Research
Authors:
Hancheng Cao,
Yujie Lu,
Yuting Deng,
Daniel A. McFarland,
Michael S. Bernstein
Abstract:
What is the impact of human-computer interaction research on industry? While it is impossible to track all research impact pathways, the growing literature on translational research impact measurement offers patent citations as one measure of how industry recognizes and draws on research in its inventions. In this paper, we perform a large-scale measurement study primarily of 70,000 patent citatio…
▽ More
What is the impact of human-computer interaction research on industry? While it is impossible to track all research impact pathways, the growing literature on translational research impact measurement offers patent citations as one measure of how industry recognizes and draws on research in its inventions. In this paper, we perform a large-scale measurement study primarily of 70,000 patent citations to premier HCI research venues, tracing how HCI research are cited in United States patents over the last 30 years. We observe that 20.1% of papers from these venues, including 60--80% of papers at UIST and 13% of papers in a broader dataset of SIGCHI-sponsored venues overall, are cited by patents -- far greater than premier venues in science overall (9.7%) and NLP (11%). However, the time lag between a patent and its paper citations is long (10.5 years) and getting longer, suggesting that HCI research and practice may not be efficiently connected.
△ Less
Submitted 31 January, 2023;
originally announced January 2023.
-
Measuring the Prevalence of Anti-Social Behavior in Online Communities
Authors:
Joon Sung Park,
Joseph Seering,
Michael S. Bernstein
Abstract:
With increasing attention to online anti-social behaviors such as personal attacks and bigotry, it is critical to have an accurate accounting of how widespread anti-social behaviors are. In this paper, we empirically measure the prevalence of anti-social behavior in one of the world's most popular online community platforms. We operationalize this goal as measuring the proportion of unmoderated co…
▽ More
With increasing attention to online anti-social behaviors such as personal attacks and bigotry, it is critical to have an accurate accounting of how widespread anti-social behaviors are. In this paper, we empirically measure the prevalence of anti-social behavior in one of the world's most popular online community platforms. We operationalize this goal as measuring the proportion of unmoderated comments in the 97 most popular communities on Reddit that violate eight widely accepted platform norms. To achieve this goal, we contribute a human-AI pipeline for identifying these violations and a bootstrap sampling method to quantify measurement uncertainty. We find that 6.25% (95% Confidence Interval [5.36%, 7.13%]) of all comments in 2016, and 4.28% (95% CI [2.50%, 6.26%]) in 2020-2021, are violations of these norms. Most anti-social behaviors remain unmoderated: moderators only removed one in twenty violating comments in 2016, and one in ten violating comments in 2020. Personal attacks were the most prevalent category of norm violation; pornography and bigotry were the most likely to be moderated, while politically inflammatory comments and misogyny/vulgarity were the least likely to be moderated. This paper offers a method and set of empirical results for tracking these phenomena as both the social practices (e.g., moderation) and technical practices (e.g., design) evolve.
△ Less
Submitted 27 August, 2022;
originally announced August 2022.
-
Social Simulacra: Creating Populated Prototypes for Social Computing Systems
Authors:
Joon Sung Park,
Lindsay Popowski,
Carrie J. Cai,
Meredith Ringel Morris,
Percy Liang,
Michael S. Bernstein
Abstract:
Social computing prototypes probe the social behaviors that may arise in an envisioned system design. This prototyping practice is currently limited to recruiting small groups of people. Unfortunately, many challenges do not arise until a system is populated at a larger scale. Can a designer understand how a social system might behave when populated, and make adjustments to the design before the s…
▽ More
Social computing prototypes probe the social behaviors that may arise in an envisioned system design. This prototyping practice is currently limited to recruiting small groups of people. Unfortunately, many challenges do not arise until a system is populated at a larger scale. Can a designer understand how a social system might behave when populated, and make adjustments to the design before the system falls prey to such challenges? We introduce social simulacra, a prototyping technique that generates a breadth of realistic social interactions that may emerge when a social computing system is populated. Social simulacra take as input the designer's description of a community's design -- goal, rules, and member personas -- and produce as output an instance of that design with simulated behavior, including posts, replies, and anti-social behaviors. We demonstrate that social simulacra shift the behaviors that they generate appropriately in response to design changes, and that they enable exploration of "what if?" scenarios where community members or moderators intervene. To power social simulacra, we contribute techniques for prompting a large language model to generate thousands of distinct community members and their social interactions with each other; these techniques are enabled by the observation that large language models' training data already includes a wide variety of positive and negative behavior on social media platforms. In evaluations, we show that participants are often unable to distinguish social simulacra from actual community behavior and that social computing designers successfully refine their social computing designs when using social simulacra.
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
Balancing Producer Fairness and Efficiency via Prior-Weighted Rating System Design
Authors:
Thomas Ma,
Michael S. Bernstein,
Ramesh Johari,
Nikhil Garg
Abstract:
Online marketplaces use rating systems to promote the discovery of high-quality products. However, these systems also lead to high variance in producers' economic outcomes: a new producer who sells high-quality items, may unluckily receive one low rating early on, negatively impacting their future popularity. We investigate the design of rating systems that balance the goals of identifying high-qu…
▽ More
Online marketplaces use rating systems to promote the discovery of high-quality products. However, these systems also lead to high variance in producers' economic outcomes: a new producer who sells high-quality items, may unluckily receive one low rating early on, negatively impacting their future popularity. We investigate the design of rating systems that balance the goals of identifying high-quality products (efficiency) and minimizing the variance in economic outcomes of producers of similar quality (individual producer fairness).
We show that there is a trade-off between these two goals: rating systems that promote efficiency are necessarily less individually fair to producers. We introduce prior-weighted rating systems as an approach to managing this trade-off. Informally, the system we propose sets a system-wide prior for the quality of an incoming product; subsequently, the system updates that prior to a posterior for each producer's quality based on user-generated ratings over time. We show theoretically that in markets where products accrue reviews at an equal rate, the strength of the rating system's prior determines the operating point on the identified trade-off: the stronger the prior, the more the marketplace discounts early ratings data (increasing individual fairness), but the slower the platform is in learning about true item quality (so efficiency suffers). We further analyze this trade-off in a responsive market where customers make decisions based on historical ratings. Through calibrated simulations, we show that the choice of prior strength mediates the same efficiency-consistency trade-off in this setting. Overall, we demonstrate that by tuning the prior as a design choice in a prior-weighted rating system, platforms can be intentional about the balance between efficiency and producer fairness.
△ Less
Submitted 25 November, 2023; v1 submitted 9 July, 2022;
originally announced July 2022.
-
A Web-Scale Analysis of the Community Origins of Image Memes
Authors:
Durim Morina,
Michael S. Bernstein
Abstract:
Where do the most popular online cultural artifacts such as image memes originate? Media narratives suggest that cultural innovations often originate in peripheral communities and then diffuse to the mainstream core; behavioral science suggests that intermediate network positions that bridge between the periphery and the core are especially likely to originate many influential cultural innovations…
▽ More
Where do the most popular online cultural artifacts such as image memes originate? Media narratives suggest that cultural innovations often originate in peripheral communities and then diffuse to the mainstream core; behavioral science suggests that intermediate network positions that bridge between the periphery and the core are especially likely to originate many influential cultural innovations. Research has yet to fully adjudicate between these predictions because prior work focuses on individual platforms such as Twitter; however, any single platform is only a small, incomplete part of the larger online cultural ecosystem. In this paper, we perform the first analysis of the origins and diffusion of image memes at web scale, via a one-month crawl of all indexible online communities that principally share meme images with English text overlays. Our results suggest that communities at the core of the network originate the most highly diffused image memes: the top 10% of communities by network centrality originate the memes that generate 62% of the image meme diffusion events on the web. A zero-inflated negative binomial regression confirms that memes from core communities are more likely to diffuse than those from peripheral communities even when controlling for community size and activity level. However, a replication analysis that follows the traditional approach of testing the same question only within a single large community, Reddit, finds the regression coefficients reversed -- underscoring the importance of engaging in web-scale, cross-community analyses. The ecosystem-level viewpoint of this work positions the web as a highly centralized generator of cultural artifacts such as image memes.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
Comparing the Perceived Legitimacy of Content Moderation Processes: Contractors, Algorithms, Expert Panels, and Digital Juries
Authors:
Christina A. Pan,
Sahil Yakhmi,
Tara P. Iyer,
Evan Strasnick,
Amy X. Zhang,
Michael S. Bernstein
Abstract:
While research continues to investigate and improve the accuracy, fairness, and normative appropriateness of content moderation processes on large social media platforms, even the best process cannot be effective if users reject its authority as illegitimate. We present a survey experiment comparing the perceived institutional legitimacy of four popular content moderation processes. We conducted a…
▽ More
While research continues to investigate and improve the accuracy, fairness, and normative appropriateness of content moderation processes on large social media platforms, even the best process cannot be effective if users reject its authority as illegitimate. We present a survey experiment comparing the perceived institutional legitimacy of four popular content moderation processes. We conducted a within-subjects experiment in which we showed US Facebook users moderation decisions and randomized the description of whether those decisions were made by paid contractors, algorithms, expert panels, or juries of users. Prior work suggests that juries will have the highest perceived legitimacy due to the benefits of judicial independence and democratic representation. However, expert panels had greater perceived legitimacy than algorithms or juries. Moreover, outcome alignment - agreement with the decision - played a larger role than process in determining perceived legitimacy. These results suggest benefits to incorporating expert oversight in content moderation and underscore that any process will face legitimacy challenges derived from disagreement about outcomes.
△ Less
Submitted 6 October, 2022; v1 submitted 13 February, 2022;
originally announced February 2022.
-
Jury Learning: Integrating Dissenting Voices into Machine Learning Models
Authors:
Mitchell L. Gordon,
Michelle S. Lam,
Joon Sung Park,
Kayur Patel,
Jeffrey T. Hancock,
Tatsunori Hashimoto,
Michael S. Bernstein
Abstract:
Whose labels should a machine learning (ML) algorithm learn to emulate? For ML tasks ranging from online comment toxicity to misinformation detection to medical diagnosis, different groups in society may have irreconcilable disagreements about ground truth labels. Supervised ML today resolves these label disagreements implicitly using majority vote, which overrides minority groups' labels. We intr…
▽ More
Whose labels should a machine learning (ML) algorithm learn to emulate? For ML tasks ranging from online comment toxicity to misinformation detection to medical diagnosis, different groups in society may have irreconcilable disagreements about ground truth labels. Supervised ML today resolves these label disagreements implicitly using majority vote, which overrides minority groups' labels. We introduce jury learning, a supervised ML approach that resolves these disagreements explicitly through the metaphor of a jury: defining which people or groups, in what proportion, determine the classifier's prediction. For example, a jury learning model for online toxicity might centrally feature women and Black jurors, who are commonly targets of online harassment. To enable jury learning, we contribute a deep learning architecture that models every annotator in a dataset, samples from annotators' models to populate the jury, then runs inference to classify. Our architecture enables juries that dynamically adapt their composition, explore counterfactuals, and visualize dissent.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
A "Distance Matters" Paradox: Facilitating Intra-Team Collaboration Can Harm Inter-Team Collaboration
Authors:
Xinlan Emily Hu,
Rebecca Hinds,
Melissa A. Valentine,
Michael S. Bernstein
Abstract:
By identifying the socio-technical conditions required for teams to work effectively remotely, the Distance Matters framework has been influential in CSCW since its introduction in 2000. Advances in collaboration technology and practices have since brought teams increasingly closer to achieving these conditions. This paper presents a ten-month ethnography in a remote organization, where we observe…
▽ More
By identifying the socio-technical conditions required for teams to work effectively remotely, the Distance Matters framework has been influential in CSCW since its introduction in 2000. Advances in collaboration technology and practices have since brought teams increasingly closer to achieving these conditions. This paper presents a ten-month ethnography in a remote organization, where we observed that despite exhibiting excellent remote collaboration, teams paradoxically struggled to collaborate across team boundaries. We extend the Distance Matters framework to account for inter-team collaboration, arguing that challenges analogous to those in the original intra-team framework -- common ground, collaboration readiness, collaboration technology readiness, and coupling of work -- persist but are actualized differently at the inter-team scale. Finally, we identify a fundamental tension between the intra- and inter-team layers: the collaboration technology and practices that help individual teams thrive (e.g., adopting customized collaboration software) can also prompt collaboration challenges in the inter-team layer, and conversely the technology and practices that facilitate inter-team collaboration (e.g., strong centralized IT organizations) can harm practices at the intra-team layer. The addition of the inter-team layer to the Distance Matters framework opens new opportunities for CSCW, where balancing the tension between team and organizational collaboration needs will be a critical technological, operational, and organizational challenge for remote work in the coming decades.
△ Less
Submitted 4 February, 2022;
originally announced February 2022.
-
Crowdsourcing County-Level Data on Early COVID-19 Policy Interventions in the United States: Technical Report
Authors:
Jacob Ritchie,
Mark Whiting,
Sorathan Chaturapruek,
J. D. Zamfirescu-Pereira,
Madhav Marathe,
Achla Marathe,
Stephen Eubank,
Michael S. Bernstein
Abstract:
Beginning in April 2020, we gathered partial county-level data on non-pharmaceutical interventions (NPIs) implemented in response to the COVID-19 pandemic in the United States, using both volunteer and paid crowdsourcing. In this report, we document the data collection process and summarize our results, to increase the utility of our open data and inform the design of future rapid crowdsourcing da…
▽ More
Beginning in April 2020, we gathered partial county-level data on non-pharmaceutical interventions (NPIs) implemented in response to the COVID-19 pandemic in the United States, using both volunteer and paid crowdsourcing. In this report, we document the data collection process and summarize our results, to increase the utility of our open data and inform the design of future rapid crowdsourcing data collection efforts.
△ Less
Submitted 15 December, 2021;
originally announced December 2021.
-
On the Opportunities and Risks of Foundation Models
Authors:
Rishi Bommasani,
Drew A. Hudson,
Ehsan Adeli,
Russ Altman,
Simran Arora,
Sydney von Arx,
Michael S. Bernstein,
Jeannette Bohg,
Antoine Bosselut,
Emma Brunskill,
Erik Brynjolfsson,
Shyamal Buch,
Dallas Card,
Rodrigo Castellon,
Niladri Chatterji,
Annie Chen,
Kathleen Creel,
Jared Quincy Davis,
Dora Demszky,
Chris Donahue,
Moussa Doumbouya,
Esin Durmus,
Stefano Ermon,
John Etchemendy,
Kawin Ethayarajh
, et al. (89 additional authors not shown)
Abstract:
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap…
▽ More
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
△ Less
Submitted 12 July, 2022; v1 submitted 16 August, 2021;
originally announced August 2021.
-
ESR: Ethics and Society Review of Artificial Intelligence Research
Authors:
Michael S. Bernstein,
Margaret Levi,
David Magnus,
Betsy Rajala,
Debra Satz,
Charla Waeiss
Abstract:
Artificial intelligence (AI) research is routinely criticized for its real and potential impacts on society, and we lack adequate institutional responses to this criticism and to the responsibility that it reflects. AI research often falls outside the purview of existing feedback mechanisms such as the Institutional Review Board (IRB), which are designed to evaluate harms to human subjects rather…
▽ More
Artificial intelligence (AI) research is routinely criticized for its real and potential impacts on society, and we lack adequate institutional responses to this criticism and to the responsibility that it reflects. AI research often falls outside the purview of existing feedback mechanisms such as the Institutional Review Board (IRB), which are designed to evaluate harms to human subjects rather than harms to human society. In response, we have developed the Ethics and Society Review board (ESR), a feedback panel that works with researchers to mitigate negative ethical and societal aspects of AI research. The ESR's main insight is to serve as a requirement for funding: researchers cannot receive grant funding from a major AI funding program at our university until the researchers complete the ESR process for the proposal. In this article, we describe the ESR as we have designed and run it over its first year across 41 proposals. We analyze aggregate ESR feedback on these proposals, finding that the panel most commonly identifies issues of harms to minority groups, inclusion of diverse stakeholders in the research plan, dual use, and representation in data. Surveys and interviews of researchers who interacted with the ESR found that 58% felt that it had influenced the design of their research project, 100% are willing to continue submitting future projects to the ESR, and that they sought additional scaffolding for reasoning through ethics and society issues.
△ Less
Submitted 9 July, 2021; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Understanding the Representation and Representativeness of Age in AI Data Sets
Authors:
Joon Sung Park,
Michael S. Bernstein,
Robin N. Brewer,
Ece Kamar,
Meredith Ringel Morris
Abstract:
A diverse representation of different demographic groups in AI training data sets is important in ensuring that the models will work for a large range of users. To this end, recent efforts in AI fairness and inclusion have advocated for creating AI data sets that are well-balanced across race, gender, socioeconomic status, and disability status. In this paper, we contribute to this line of work by…
▽ More
A diverse representation of different demographic groups in AI training data sets is important in ensuring that the models will work for a large range of users. To this end, recent efforts in AI fairness and inclusion have advocated for creating AI data sets that are well-balanced across race, gender, socioeconomic status, and disability status. In this paper, we contribute to this line of work by focusing on the representation of age by asking whether older adults are represented proportionally to the population at large in AI data sets. We examine publicly-available information about 92 face data sets to understand how they codify age as a case study to investigate how the subjects' ages are recorded and whether older generations are represented. We find that older adults are very under-represented; five data sets in the study that explicitly documented the closed age intervals of their subjects included older adults (defined as older than 65 years), while only one included oldest-old adults (defined as older than 85 years). Additionally, we find that only 24 of the data sets include any age-related information in their documentation or metadata, and that there is no consistent method followed across these data sets to collect and record the subjects' ages. We recognize the unique difficulties in creating representative data sets in terms of age, but raise it as an important dimension that researchers and engineers interested in inclusive AI should consider.
△ Less
Submitted 6 May, 2021; v1 submitted 10 March, 2021;
originally announced March 2021.
-
Not Now, Ask Later: Users Weaken Their Behavior Change Regimen Over Time, But Expect To Re-Strengthen It Imminently
Authors:
Geza Kovacs,
Zhengxuan Wu,
Michael S. Bernstein
Abstract:
How effectively do we adhere to nudges and interventions that help us control our online browsing habits? If we have a temporary lapse and disable the behavior change system, do we later resume our adherence, or has the dam broken? In this paper, we investigate these questions through log analyses of 8,000+ users on HabitLab, a behavior change platform that helps users reduce their time online. We…
▽ More
How effectively do we adhere to nudges and interventions that help us control our online browsing habits? If we have a temporary lapse and disable the behavior change system, do we later resume our adherence, or has the dam broken? In this paper, we investigate these questions through log analyses of 8,000+ users on HabitLab, a behavior change platform that helps users reduce their time online. We find that, while users typically begin with high-challenge interventions, over time they allow themselves to slip into easier and easier interventions. Despite this, many still expect to return to the harder interventions imminently: they repeatedly choose to be asked to change difficulty again on the next visit, declining to have the system save their preference for easy interventions.
△ Less
Submitted 27 January, 2021;
originally announced January 2021.
-
My Team Will Go On: Differentiating High and Low Viability Teams through Team Interaction
Authors:
Hancheng Cao,
Vivian Yang,
Victor Chen,
Yu Jin Lee,
Lydia Stone,
N'godjigui Junior Diarrassouba,
Mark E. Whiting,
Michael S. Bernstein
Abstract:
Understanding team viability -- a team's capacity for sustained and future success -- is essential for building effective teams. In this study, we aggregate features drawn from the organizational behavior literature to train a viability classification model over a dataset of 669 10-minute text conversations of online teams. We train classifiers to identify teams at the top decile (most viable team…
▽ More
Understanding team viability -- a team's capacity for sustained and future success -- is essential for building effective teams. In this study, we aggregate features drawn from the organizational behavior literature to train a viability classification model over a dataset of 669 10-minute text conversations of online teams. We train classifiers to identify teams at the top decile (most viable teams), 50th percentile (above a median split), and bottom decile (least viable teams), then characterize the attributes of teams at each of these viability levels. We find that a lasso regression model achieves an accuracy of .74--.92 AUC ROC under different thresholds of classifying viability scores. From these models, we identify the use of exclusive language such as `but' and `except', and the use of second person pronouns, as the most predictive features for detecting the most viable teams, suggesting that active engagement with others' ideas is a crucial signal of a viable team. Only a small fraction of the 10-minute discussion, as little as 70 seconds, is required for predicting the viability of team interaction. This work suggests opportunities for teams to assess, track, and visualize their own viability in real time as they collaborate.
△ Less
Submitted 3 November, 2020; v1 submitted 14 October, 2020;
originally announced October 2020.
-
PolicyKit: Building Governance in Online Communities
Authors:
Amy X. Zhang,
Grant Hugh,
Michael S. Bernstein
Abstract:
The software behind online community platforms encodes a governance model that represents a strikingly narrow set of governance possibilities focused on moderators and administrators. When online communities desire other forms of government, such as ones that take many members' opinions into account or that distribute power in non-trivial ways, communities must resort to laborious manual effort. I…
▽ More
The software behind online community platforms encodes a governance model that represents a strikingly narrow set of governance possibilities focused on moderators and administrators. When online communities desire other forms of government, such as ones that take many members' opinions into account or that distribute power in non-trivial ways, communities must resort to laborious manual effort. In this paper, we present PolicyKit, a software infrastructure that empowers online community members to concisely author a wide range of governance procedures and automatically carry out those procedures on their home platforms. We draw on political science theory to encode community governance into policies, or short imperative functions that specify a procedure for determining whether a user-initiated action can execute. Actions that can be governed by policies encompass everyday activities such as posting or moderating a message, but actions can also encompass changes to the policies themselves, enabling the evolution of governance over time. We demonstrate the expressivity of PolicyKit through implementations of governance models such as a random jury deliberation, a multi-stage caucus, a reputation system, and a promotion procedure inspired by Wikipedia's Request for Adminship (RfA) process.
△ Less
Submitted 17 August, 2020; v1 submitted 10 August, 2020;
originally announced August 2020.
-
Establishing an Evaluation Metric to Quantify Climate Change Image Realism
Authors:
Sharon Zhou,
Alexandra Luccioni,
Gautier Cosne,
Michael S. Bernstein,
Yoshua Bengio
Abstract:
With success on controlled tasks, generative models are being increasingly applied to humanitarian applications [1,2]. In this paper, we focus on the evaluation of a conditional generative model that illustrates the consequences of climate change-induced flooding to encourage public interest and awareness on the issue. Because metrics for comparing the realism of different modes in a conditional g…
▽ More
With success on controlled tasks, generative models are being increasingly applied to humanitarian applications [1,2]. In this paper, we focus on the evaluation of a conditional generative model that illustrates the consequences of climate change-induced flooding to encourage public interest and awareness on the issue. Because metrics for comparing the realism of different modes in a conditional generative model do not exist, we propose several automated and human-based methods for evaluation. To do this, we adapt several existing metrics, and assess the automated metrics against gold standard human evaluation. We find that using Fréchet Inception Distance (FID) with embeddings from an intermediary Inception-V3 layer that precedes the auxiliary classifier produces results most correlated with human realism. While insufficient alone to establish a human-correlated automatic evaluation metric, we believe this work begins to bridge the gap between human and automated generative evaluation procedures.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
Boomerang: Rebounding the Consequences of Reputation Feedback on Crowdsourcing Platforms
Authors:
Snehalkumar,
S. Gaikwad,
Durim Morina,
Adam Ginzberg,
Catherine Mullings,
Shirish Goyal,
Dilrukshi Gamage,
Christopher Diemert,
Mathias Burton,
Sharon Zhou,
Mark Whiting,
Karolina Ziulkoski,
Alipta Ballav,
Aaron Gilbee,
Senadhipathige S. Niranga,
Vibhor Sehgal,
Jasmine Lin,
Leonardy Kristianto,
Angela Richmond-Fuller,
Jeff Regino,
Nalin Chhibber,
Dinesh Majeti,
Sachin Sharma,
Kamila Mananova,
Dinesh Dhakal
, et al. (13 additional authors not shown)
Abstract:
Paid crowdsourcing platforms suffer from low-quality work and unfair rejections, but paradoxically, most workers and requesters have high reputation scores. These inflated scores, which make high-quality work and workers difficult to find, stem from social pressure to avoid giving negative feedback. We introduce Boomerang, a reputation system for crowdsourcing that elicits more accurate feedback b…
▽ More
Paid crowdsourcing platforms suffer from low-quality work and unfair rejections, but paradoxically, most workers and requesters have high reputation scores. These inflated scores, which make high-quality work and workers difficult to find, stem from social pressure to avoid giving negative feedback. We introduce Boomerang, a reputation system for crowdsourcing that elicits more accurate feedback by rebounding the consequences of feedback directly back onto the person who gave it. With Boomerang, requesters find that their highly-rated workers gain earliest access to their future tasks, and workers find tasks from their highly-rated requesters at the top of their task feed. Field experiments verify that Boomerang causes both workers and requesters to provide feedback that is more closely aligned with their private opinions. Inspired by a game-theoretic notion of incentive-compatibility, Boomerang opens opportunities for interaction design to incentivize honest reporting over strategic dishonesty.
△ Less
Submitted 14 April, 2019;
originally announced April 2019.
-
HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models
Authors:
Sharon Zhou,
Mitchell L. Gordon,
Ranjay Krishna,
Austin Narcomey,
Li Fei-Fei,
Michael S. Bernstein
Abstract:
Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We constru…
▽ More
Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.
△ Less
Submitted 31 October, 2019; v1 submitted 1 April, 2019;
originally announced April 2019.
-
Mechanical Novel: Crowdsourcing Complex Work through Reflection and Revision
Authors:
Joy Kim,
Sarah Sterman,
Allegra Argent Beal Cohen,
Michael S. Bernstein
Abstract:
Crowdsourcing systems accomplish large tasks with scale and speed by breaking work down into independent parts. However, many types of complex creative work, such as fiction writing, have remained out of reach for crowds because work is tightly interdependent: changing one part of a story may trigger changes to the overall plot and vice versa. Taking inspiration from how expert authors write, we p…
▽ More
Crowdsourcing systems accomplish large tasks with scale and speed by breaking work down into independent parts. However, many types of complex creative work, such as fiction writing, have remained out of reach for crowds because work is tightly interdependent: changing one part of a story may trigger changes to the overall plot and vice versa. Taking inspiration from how expert authors write, we propose a technique for achieving interdependent complex goals with crowds. With this technique, the crowd loops between reflection, to select a high-level goal, and revision, to decompose that goal into low-level, actionable tasks. We embody this approach in Mechanical Novel, a system that crowdsources short fiction stories on Amazon Mechanical Turk. In a field experiment, Mechanical Novel resulted in higher-quality stories than an iterative crowdsourcing workflow. Our findings suggest that orienting crowd work around high-level goals may enable workers to coordinate their effort to accomplish complex work.
△ Less
Submitted 8 November, 2016;
originally announced November 2016.
-
Mosaic: Designing Online Creative Communities for Sharing Works-in-Progress
Authors:
Joy Kim,
Maneesh Agrawala,
Michael S. Bernstein
Abstract:
Online creative communities allow creators to share their work with a large audience, maximizing opportunities to showcase their work and connect with fans and peers. However, sharing in-progress work can be technically and socially challenging in environments designed for sharing completed pieces. We propose an online creative community where sharing process, rather than showcasing outcomes, is t…
▽ More
Online creative communities allow creators to share their work with a large audience, maximizing opportunities to showcase their work and connect with fans and peers. However, sharing in-progress work can be technically and socially challenging in environments designed for sharing completed pieces. We propose an online creative community where sharing process, rather than showcasing outcomes, is the main method of sharing creative work. Based on this, we present Mosaic---an online community where illustrators share work-in-progress snapshots showing how an artwork was completed from start to finish. In an online deployment and observational study, artists used Mosaic as a vehicle for reflecting on how they can improve their own creative process, developed a social norm of detailed feedback, and became less apprehensive of sharing early versions of artwork. Through Mosaic, we argue that communities oriented around sharing creative process can create a collaborative environment that is beneficial for creative growth.
△ Less
Submitted 8 November, 2016;
originally announced November 2016.
-
Crowd Guilds: Worker-led Reputation and Feedback on Crowdsourcing Platforms
Authors:
Mark E. Whiting,
Dilrukshi Gamage,
Snehalkumar S. Gaikwad,
Aaron Gilbee,
Shirish Goyal,
Alipta Ballav,
Dinesh Majeti,
Nalin Chhibber,
Angela Richmond-Fuller,
Freddie Vargus,
Tejas Seshadri Sarma,
Varshine Chandrakanthan,
Teogenes Moura,
Mohamed Hashim Salih,
Gabriel Bayomi Tinoco Kalejaiye,
Adam Ginzberg,
Catherine A. Mullings,
Yoni Dayan,
Kristy Milland,
Henrique Orefice,
Jeff Regino,
Sayna Parsi,
Kunz Mainali,
Vibhor Sehgal,
Sekandar Matin
, et al. (3 additional authors not shown)
Abstract:
Crowd workers are distributed and decentralized. While decentralization is designed to utilize independent judgment to promote high-quality results, it paradoxically undercuts behaviors and institutions that are critical to high-quality work. Reputation is one central example: crowdsourcing systems depend on reputation scores from decentralized workers and requesters, but these scores are notoriou…
▽ More
Crowd workers are distributed and decentralized. While decentralization is designed to utilize independent judgment to promote high-quality results, it paradoxically undercuts behaviors and institutions that are critical to high-quality work. Reputation is one central example: crowdsourcing systems depend on reputation scores from decentralized workers and requesters, but these scores are notoriously inflated and uninformative. In this paper, we draw inspiration from historical worker guilds (e.g., in the silk trade) to design and implement crowd guilds: centralized groups of crowd workers who collectively certify each other's quality through double-blind peer assessment. A two-week field experiment compared crowd guilds to a traditional decentralized crowd work model. Crowd guilds produced reputation signals more strongly correlated with ground-truth worker quality than signals available on current crowd working platforms, and more accurate than in the traditional model.
△ Less
Submitted 28 February, 2017; v1 submitted 4 November, 2016;
originally announced November 2016.
-
A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality
Authors:
Kenji Hata,
Ranjay Krishna,
Li Fei-Fei,
Michael S. Bernstein
Abstract:
Microtask crowdsourcing is increasingly critical to the creation of extremely large datasets. As a result, crowd workers spend weeks or months repeating the exact same tasks, making it necessary to understand their behavior over these long periods of time. We utilize three large, longitudinal datasets of nine million annotations collected from Amazon Mechanical Turk to examine claims that workers…
▽ More
Microtask crowdsourcing is increasingly critical to the creation of extremely large datasets. As a result, crowd workers spend weeks or months repeating the exact same tasks, making it necessary to understand their behavior over these long periods of time. We utilize three large, longitudinal datasets of nine million annotations collected from Amazon Mechanical Turk to examine claims that workers fatigue or satisfice over these long periods, producing lower quality work. We find that, contrary to these claims, workers are extremely stable in their quality over the entire period. To understand whether workers set their quality based on the task's requirements for acceptance, we then perform an experiment where we vary the required quality for a large crowdsourcing task. Workers did not adjust their quality based on the acceptance threshold: workers who were above the threshold continued working at their usual quality level, and workers below the threshold self-selected themselves out of the task. Capitalizing on this consistency, we demonstrate that it is possible to predict workers' long-term quality using just a glimpse of their quality on the first five tasks.
△ Less
Submitted 1 November, 2016; v1 submitted 15 September, 2016;
originally announced September 2016.
-
Shirtless and Dangerous: Quantifying Linguistic Signals of Gender Bias in an Online Fiction Writing Community
Authors:
Ethan Fast,
Tina Vachovsky,
Michael S. Bernstein
Abstract:
Imagine a princess asleep in a castle, waiting for her prince to slay the dragon and rescue her. Tales like the famous Sleeping Beauty clearly divide up gender roles. But what about more modern stories, borne of a generation increasingly aware of social constructs like sexism and racism? Do these stories tend to reinforce gender stereotypes, or counter them? In this paper, we present a technique t…
▽ More
Imagine a princess asleep in a castle, waiting for her prince to slay the dragon and rescue her. Tales like the famous Sleeping Beauty clearly divide up gender roles. But what about more modern stories, borne of a generation increasingly aware of social constructs like sexism and racism? Do these stories tend to reinforce gender stereotypes, or counter them? In this paper, we present a technique that combines natural language processing with a crowdsourced lexicon of stereotypes to capture gender biases in fiction. We apply this technique across 1.8 billion words of fiction from the Wattpad online writing community, investigating gender representation in stories, how male and female characters behave and are described, and how authors' use of gender stereotypes is associated with the community's ratings. We find that male over-representation and traditional gender stereotypes (e.g., dominant men and submissive women) are common throughout nearly every genre in our corpus. However, only some of these stereotypes, like sexual or violent men, are associated with highly rated stories. Finally, despite women often being the target of negative stereotypes, female authors are equally likely to write such stereotypes as men.
△ Less
Submitted 29 March, 2016;
originally announced March 2016.
-
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Authors:
Ranjay Krishna,
Yuke Zhu,
Oliver Groth,
Justin Johnson,
Kenji Hata,
Joshua Kravitz,
Stephanie Chen,
Yannis Kalantidis,
Li-Jia Li,
David A. Shamma,
Michael S. Bernstein,
Fei-Fei Li
Abstract:
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designe…
▽ More
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage".
In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.
△ Less
Submitted 23 February, 2016;
originally announced February 2016.
-
Atelier: Repurposing Expert Crowdsourcing Tasks as Micro-internships
Authors:
Ryo Suzuki,
Niloufar Salehi,
Michelle S. Lam,
Juan C. Marroquin,
Michael S. Bernstein
Abstract:
Expert crowdsourcing marketplaces have untapped potential to empower workers' career and skill development. Currently, many workers cannot afford to invest the time and sacrifice the earnings required to learn a new skill, and a lack of experience makes it difficult to get job offers even if they do. In this paper, we seek to lower the threshold to skill development by repurposing existing tasks o…
▽ More
Expert crowdsourcing marketplaces have untapped potential to empower workers' career and skill development. Currently, many workers cannot afford to invest the time and sacrifice the earnings required to learn a new skill, and a lack of experience makes it difficult to get job offers even if they do. In this paper, we seek to lower the threshold to skill development by repurposing existing tasks on the marketplace as mentored, paid, real-world work experiences, which we refer to as micro-internships. We instantiate this idea in Atelier, a micro-internship platform that connects crowd interns with crowd mentors. Atelier guides mentor-intern pairs to break down expert crowdsourcing tasks into milestones, review intermediate output, and problem-solve together. We conducted a field experiment comparing Atelier's mentorship model to a non-mentored alternative on a real-world programming crowdsourcing task, finding that Atelier helped interns maintain forward progress and absorb best practices.
△ Less
Submitted 21 February, 2016;
originally announced February 2016.
-
Embracing Error to Enable Rapid Crowdsourcing
Authors:
Ranjay Krishna,
Kenji Hata,
Stephanie Chen,
Joshua Kravitz,
David A. Shamma,
Li Fei-Fei,
Michael S. Bernstein
Abstract:
Microtask crowdsourcing has enabled dataset advances in social science and machine learning, but existing crowdsourcing schemes are too expensive to scale up with the expanding volume of data. To scale and widen the applicability of crowdsourcing, we present a technique that produces extremely rapid judgments for binary and categorical labels. Rather than punishing all errors, which causes workers…
▽ More
Microtask crowdsourcing has enabled dataset advances in social science and machine learning, but existing crowdsourcing schemes are too expensive to scale up with the expanding volume of data. To scale and widen the applicability of crowdsourcing, we present a technique that produces extremely rapid judgments for binary and categorical labels. Rather than punishing all errors, which causes workers to proceed slowly and deliberately, our technique speeds up workers' judgments to the point where errors are acceptable and even expected. We demonstrate that it is possible to rectify these errors by randomizing task order and modeling response latency. We evaluate our technique on a breadth of common labeling tasks such as image verification, word similarity, sentiment analysis and topic classification. Where prior work typically achieves a 0.25x to 1x speedup over fixed majority vote, our approach often achieves an order of magnitude (10x) speedup.
△ Less
Submitted 14 February, 2016;
originally announced February 2016.
-
SentenceRacer: A Game with a Purpose for Image Sentence Annotation
Authors:
Kenji Hata,
Sherman Leung,
Ranjay Krishna,
Michael S. Bernstein,
Li Fei-Fei
Abstract:
Recently datasets that contain sentence descriptions of images have enabled models that can automatically generate image captions. However, collecting these datasets are still very expensive. Here, we present SentenceRacer, an online game that gathers and verifies descriptions of images at no cost. Similar to the game hangman, players compete to uncover words in a sentence that ultimately describe…
▽ More
Recently datasets that contain sentence descriptions of images have enabled models that can automatically generate image captions. However, collecting these datasets are still very expensive. Here, we present SentenceRacer, an online game that gathers and verifies descriptions of images at no cost. Similar to the game hangman, players compete to uncover words in a sentence that ultimately describes an image. SentenceRacer both generates and verifies that the sentences are accurate descriptions. We show that SentenceRacer generates annotations of higher quality than those generated on Amazon Mechanical Turk (AMT).
△ Less
Submitted 27 August, 2015;
originally announced August 2015.
-
Designing and Deploying Online Field Experiments
Authors:
Eytan Bakshy,
Dean Eckles,
Michael S. Bernstein
Abstract:
Online experiments are widely used to compare specific design alternatives, but they can also be used to produce generalizable knowledge and inform strategic decision making. Doing so often requires sophisticated experimental designs, iterative refinement, and careful logging and analysis. Few tools exist that support these needs. We thus introduce a language for online field experiments called Pl…
▽ More
Online experiments are widely used to compare specific design alternatives, but they can also be used to produce generalizable knowledge and inform strategic decision making. Doing so often requires sophisticated experimental designs, iterative refinement, and careful logging and analysis. Few tools exist that support these needs. We thus introduce a language for online field experiments called PlanOut. PlanOut separates experimental design from application code, allowing the experimenter to concisely describe experimental designs, whether common "A/B tests" and factorial designs, or more complex designs involving conditional logic or multiple experimental units. These latter designs are often useful for understanding causal mechanisms involved in user behaviors. We demonstrate how experiments from the literature can be implemented in PlanOut, and describe two large field experiments conducted on Facebook with PlanOut. For common scenarios in which experiments are run iteratively and in parallel, we introduce a namespaced management system that encourages sound experimental practice.
△ Less
Submitted 10 September, 2014;
originally announced September 2014.
-
Analytic Methods for Optimizing Realtime Crowdsourcing
Authors:
Michael S. Bernstein,
David R. Karger,
Robert C. Miller,
Joel Brandt
Abstract:
Realtime crowdsourcing research has demonstrated that it is possible to recruit paid crowds within seconds by managing a small, fast-reacting worker pool. Realtime crowds enable crowd-powered systems that respond at interactive speeds: for example, cameras, robots and instant opinion polls. So far, these techniques have mainly been proof-of-concept prototypes: research has not yet attempted to und…
▽ More
Realtime crowdsourcing research has demonstrated that it is possible to recruit paid crowds within seconds by managing a small, fast-reacting worker pool. Realtime crowds enable crowd-powered systems that respond at interactive speeds: for example, cameras, robots and instant opinion polls. So far, these techniques have mainly been proof-of-concept prototypes: research has not yet attempted to understand how they might work at large scale or optimize their cost/performance trade-offs. In this paper, we use queueing theory to analyze the retainer model for realtime crowdsourcing, in particular its expected wait time and cost to requesters. We provide an algorithm that allows requesters to minimize their cost subject to performance requirements. We then propose and analyze three techniques to improve performance: push notifications, shared retainer pools, and precruitment, which involves recalling retainer workers before a task actually arrives. An experimental validation finds that precruited workers begin a task 500 milliseconds after it is posted, delivering results below the one-second cognitive threshold for an end-user to stay in flow.
△ Less
Submitted 13 April, 2012;
originally announced April 2012.