[Resource Topic] 2024/1976: HI-CKKS: Is High-Throughput Neglected? Reimagining CKKS Efficiency with Parallelism

system · December 12, 2024, 10:26am

Welcome to the resource topic for 2024/1976

Title:
HI-CKKS: Is High-Throughput Neglected? Reimagining CKKS Efficiency with Parallelism

Authors: Fuyuan Chen, Jiankuo Dong, Xiaoyu Hu, Zhenjiang Dong, Wangchen Dai, Jingqiang Lin, Fu Xiao

Abstract:

The proliferation of data outsourcing and cloud services has heightened privacy vulnerabilities. CKKS, among the most prominent homomorphic encryption schemes, allows computations on encrypted data, serving as a critical privacy safeguard. However, performance remains a central bottleneck, hindering widespread adoption. Existing optimization efforts often prioritize latency reduction over throughput performance. This paper presents HI-CKKS, a throughput-oriented High-performance Implementation of CKKS homomorphic encryption, addressing these challenges. Our HI-CKKS introduces a batch-supporting asynchronous execution scheme, effectively mitigating frequent data interactions and high waiting delays between hosts and servers in service-oriented scenarios. We analyze the fundamental (I)NTT primitive, which is critical in CKKS, and develop a hierarchical, hybrid high-throughput implementation. This includes efficient arithmetic module instruction set implementations, unified kernel fusion, and hybrid memory optimization strategies that significantly improve memory access efficiency and the performance of (I)NTT operations. Additionally, we propose a multi-dimensional parallel homomorphic multiplication scheme aimed at maximizing throughput and enhancing the performance of (I)NTT and homomorphic multiplication. In conclusion, our implementation is deployed on the RTX 4090, where we conduct a thorough throughput performance evaluation of HI-CKKS, enabling us to pinpoint the most effective parallel parameter settings. Compared to the CPU implementation, our system achieves throughput increases of 175.08\times, 191.27\times, and 679.57\times for NTT, INTT, and HMult, respectively. And our throughput performance still demonstrates a significant improvement, ranging from 1.54\times to 693.17\times compared to the latest GPU-based works.

ePrint: https://eprint.iacr.org/2024/1976

See all topics related to this paper.

Feel free to post resources that are related to this paper below.

Example resources include: implementations, explanation materials, talks, slides, links to previous discussions on other websites.

For more information, see the rules for Resource Topics .