TY - GEN
T1 - Bypassing LLM Watermarks with Color-Aware Substitutions
AU - Wu, Qilong
AU - Chandrasekaran, Varun
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Watermarking approaches are proposed to identify if text being circulated is human- or large language model- (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (“green”) tokens. However, determining the robustness of this watermarking method under finite (low) edit budgets is an open problem. Additionally, existing attack methods fail to evade detection for longer text segments. We overcome these limitations, and propose Self Color Testing-based Substitution (SCTS), the first “color-aware” attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
AB - Watermarking approaches are proposed to identify if text being circulated is human- or large language model- (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (“green”) tokens. However, determining the robustness of this watermarking method under finite (low) edit budgets is an open problem. Additionally, existing attack methods fail to evade detection for longer text segments. We overcome these limitations, and propose Self Color Testing-based Substitution (SCTS), the first “color-aware” attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
UR - http://www.scopus.com/inward/record.url?scp=85204426249&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85204426249&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.acl-long.464
DO - 10.18653/v1/2024.acl-long.464
M3 - Conference contribution
AN - SCOPUS:85204426249
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 8549
EP - 8581
BT - Long Papers
A2 - Ku, Lun-Wei
A2 - Martins, Andre F. T.
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
T2 - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Y2 - 11 August 2024 through 16 August 2024
ER -