DataDetective: Dataset Watermarking for Leaker Identification in ML Training

Wegerhoff, Noa; Shapira, Avishag; Elovici, Yuval; Shabtai, Asaf

doi:10.3233/FAIA240771

Abstract

Data owners (distributors) often share machine learning (ML) datasets with third-party collaborators (agents) for various purposes. While such collaborations can be mutually beneficial, they also introduce the risk of data leakage, i.e., the deliberate or accidental disclosure of sensitive ML datasets to unauthorized parties. Consequently, distributors may lose their intellectual property, experience reduced revenue, or violate data privacy regulations. In this paper, we propose a novel black-box dataset watermarking approach called DataDetective, which not only detects the unauthorized use of protected datasets but also identifies the agent responsible for the leakage. DataDetective, which leverages a backdoor technique, is composed of two processes: In the dataset watermarking process a unique watermark signature is embedded into each agent’s version of the dataset, which embeds detectable, agent-specific behaviors in any model trained on the data. In the leaker identification process the watermark signature embedded in a suspected model is identified and compared to the signatures of all agents, to identify the leaking agent. Extensive evaluations on benchmark datasets in the computer vision domain demonstrate our method’s effectiveness; DataDetective achieved a perfect leaker identification rate with just 1% of the data watermarked. Moreover, DataDetective maintains the model’s performance with a negligible impact on model accuracy. By providing a verifiable and robust solution for leaker attribution, DataDetective enhances accountability in collaborative ML environments. For more details, the code is available at unmapped: uri https://github.com/NoaWegerhoff/data-detective.

This website uses cookies

This website uses cookies