[Resource Topic] 2024/659: Secure Latent Dirichlet Allocation

Welcome to the resource topic for 2024/659

Title:
Secure Latent Dirichlet Allocation

Authors: Thijs Veugen, Vincent Dunning, Michiel Marcus, Bart Kamphorst

Abstract:

Topic modelling refers to a popular set of techniques used to discover hidden topics that occur in a collection of documents. These topics can, for example, be used to categorize documents or label text for further processing. One popular topic modelling technique is Latent Dirichlet Allocation (LDA). In topic modelling scenarios, the documents are often assumed to be in one, centralized dataset. However, sometimes documents are held by different parties, and contain privacy- or commercially-sensitive information that cannot be shared.
We present a novel, decentralized approach to train an LDA model securely without having to share any information about the content of the documents with the other parties. We preserve the privacy of the individual parties using a combination of privacy enhancing technologies.
We show that our decentralized, privacy preserving LDA solution has a similar accuracy compared to an (insecure) centralised approach. With 1024-bit Paillier keys, a topic model with 5 topics and 3000 words can be trained in around 16 hours. Furthermore, we show that the solution scales linearly in the total number of words and the number of topics.

ePrint: https://eprint.iacr.org/2024/659

See all topics related to this paper.

Feel free to post resources that are related to this paper below.

Example resources include: implementations, explanation materials, talks, slides, links to previous discussions on other websites.

For more information, see the rules for Resource Topics .