[Resource Topic] 2025/991: MOAI: Module-Optimizing Architecture for Non-Interactive Secure Transformer Inference

Welcome to the resource topic for 2025/991

Title:
MOAI: Module-Optimizing Architecture for Non-Interactive Secure Transformer Inference

Authors: Linru Zhang, Xiangning Wang, Jun Jie Sim, Zhicong Huang, Jiahao Zhong, Huaxiong Wang, Pu Duan, Kwok Yan Lam

Abstract:

The advent of Large Language Models (LLM) has brought about a new wave productivity, revolutionizing business operations while keeping cost relatively low. The human-like interface of LLM enables it to be easily integrated with business functions, thereby freeing up precious human resources for more complex, valuable tasks. However, due to the intensive computation and memory requirements of LLM inference, it is preferable and cheaper to deploy LLMs with the Cloud Service Providers (CSP) that offer high performance computation resources and low-latency networking. Nevertheless, privacy concerns have been raised about the possibility of data leakage to the CSP. In this work, we seek to address such privacy concerns through the use of Fully Homomorphic Encryption (FHE). FHE enables the CSP to work on data in its encrypted form, thus ensuring that the data stay private and secure. We propose the implementation of LLM inference with FHE. While a series of prior work have demonstrated that it is possible to execute LLM inference in a private manner, it remains a challenge to design a solution that is practical.
Our contributions are as follows: We provide the first end-to-end open-source implementation of a non-interactive transformer inference with FHE. We report an amortized time of 9.6 minutes of one input with 128 tokens when evaluating the BERT model on CPU. Our packing methods for encrypted matrices remove the need to repack ciphertext between encrypted matrix multiplication and activation layers. Additionally, we introduce interleaved batching to eliminate the internal rotations during ciphertext matrix multiplications. Our approach also avoids HE rotations in evaluations of the softmax and layerNorm, leading to a speedup of 4.22× and 122× than existing works respectively. Our implementation supports arbitrary token lengths, in contrast with existing solutions that requires a full token embedding. Our implementation can be found at GitHub.

ePrint: https://eprint.iacr.org/2025/991

See all topics related to this paper.

Feel free to post resources that are related to this paper below.

Example resources include: implementations, explanation materials, talks, slides, links to previous discussions on other websites.

For more information, see the rules for Resource Topics .