TrojBits: A Hardware Aware Inference-Time Attack on Transformer-Based Language Models

Al Ghanim, Mansour; Santriaji, Muhammad; Lou, Qian; Solihin, Yan

doi:10.3233/FAIA230254

Abstract

Transformer-based language models demonstrate exceptional performance in Natural Language Processing (NLP) tasks but remain susceptible to backdoor attacks involving hidden input triggers. Trojan injection via hardware bitflips presents a significant challenge for contemporary language models. However, previous research overlooks practical hardware considerations, such as DRAM and cache memory structures, resulting in unrealistic attacks that demand the manipulation of an excessive number of parameters and bits. In this paper, we present TrojBits, a novel approach requiring minimal bit-flips to effectively insert Trojans into real-world Transformer language model systems. This is achieved through a three-module framework designed to efficiently target Transformer-based language models, consisting of Vulnerable Parameters Ranking (VPR), Hardware-aware Attack Optimization (HAO), and Vulnerable Bits Pruning (VBP). Within the VPR module, we are the first to employ Gradient-guided Fisher information to identify the most susceptible Transformer parameters, specifically in the word embedding layer. The HAO module then redistributes these parameters across multiple triggers, conforming to hardware constraints by incorporating a regularization term in the trojan optimization methodology. Finally, the VBP module aims to reduce the number of bit-flips by discarding less significant bits. We evaluate TrojBits on two representative NLP models, BERT and XLNE, on three classification tasks (SST2, OffensEval, and AG’s News). Our results demonstrate that our TrojBits successfully achieves the inference-time attack with only 64 parameters out of 116 million and 90-bit flips while maintaining the model performance.

This website uses cookies

This website uses cookies