[Resource Topic] 2024/759: Enhancing Watermarked Language Models to Identify Users

Welcome to the resource topic for 2024/759

Enhancing Watermarked Language Models to Identify Users

Authors: Aloni Cohen, Alexander Hoover, Gabe Schoenbach


A zero-bit watermarked language model produces text that is indistinguishable from that of the underlying model, but which can be detected as machine-generated using a secret key. Unfortunately, merely detecting AI-generated spam, say, as watermarked may not prevent future abuses. If we could additionally trace the text to a spammer’s API token or account, we could then cut off their access or pursue legal action.

We introduce multi-user watermarks, which allow tracing model-generated text to individual users or to groups of colluding users. We construct multi-user watermarking schemes from undetectable zero-bit watermarking schemes. Importantly, our schemes provide both zero-bit and multi-user assurances at the same time: detecting shorter snippets just as well as the original scheme, and tracing longer excerpts to individuals. Along the way, we give a generic construction of a watermarking scheme that embeds long messages into generated text.

Ours are the first black-box reductions between watermarking schemes for language models. A major challenge for black-box reductions is the lack of a unified abstraction for robustness — that marked text is detectable even after edits. Existing works give incomparable robustness guarantees, based on bespoke requirements on the language model’s outputs and the users’ edits. We introduce a new abstraction to overcome this challenge, called AEB-robustness. AEB-robustness provides that the watermark is detectable whenever the edited text “approximates enough blocks” of model-generated output. Specifying the robustness condition amounts to defining approximates, enough, and blocks. Using our new abstraction, we relate the robustness properties of our message-embedding and multi-user schemes to that of the underlying zero-bit scheme, in a black-box way. Whereas prior works only guarantee robustness for a single text generated in response to a single prompt, our schemes are robust against adaptive prompting, a stronger and more natural adversarial model.

ePrint: https://eprint.iacr.org/2024/759

See all topics related to this paper.

Feel free to post resources that are related to this paper below.

Example resources include: implementations, explanation materials, talks, slides, links to previous discussions on other websites.

For more information, see the rules for Resource Topics .