Transformers: Efficient one-to-many sequence generation

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Transformers have revolutionized sequence-to-sequence data processing in a wide range of industries. Their ability to handle long-range dependencies and capture contextual information has led to remarkable advancements in speech recognition, image generation, and machine translation. However, current applications primarily focus on one-to-one sequence generation, where a single source sequence is used to produce a single target sequence. In this thesis, we address the challenge of oneto-many sequence generation, where a single source sequence is used to generate multiple target sequences at an architectural level. To expand the capabilities of the transformer model, we introduce an encoder sphere projection strategy, allowing for scalable and efficient architecture-level variation during sequence generation. By generating independent vectors with uniform norms and distance from each other, the single source embedding is replicated with an added controlled variation. This expansion enables the shift from a single encoder-decoder relation to a one-to-many batched decoder supporting a set of targets to be processed with the teacher forcing framework. For the now set-based training, we incorporate a Sinkhorn loss function which encourages variation among generated output sequences while maintaining similarity to the expected targets. The loss calculation involves a pair-wise negative log-likelihood between each predicted output sequence and the ground truth targets associated with the source. This new architecture supports inherent auto-regressive inference for varied sequence generation, with up to 256 predictions per given source (limited by the model dimension). Compared with the sampling options multinomial and beam search for the base model, the expanded model achieved competitive accuracy and sped up the inference and training time. We observed a 31% reduced time to train the new model on a single K80 12GB, while a V100 32GB card saw a 27% reduction. The advantages diminished when the overhead of using multiple GPUs was introduced. The inference also showed benefits, reducing the execution time by 9% and 33% compared to multinomial and beam search sampling respectively.

Beskrivning

Ämne/nyckelord

computer science, transformer, machine learning, sequence generation, NLP, positional encoding, spherical projection, sinkhorn algorithm, drug discovery, HPC

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced