Generating Molecules in 3D from a Single Sequence
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The de novo generation of three-dimensional molecular structures is a fundamental task in drug discovery, where state-of-the-art approaches often rely on computationally expensive and architecturally complex SE(3)-equivariant models. This thesis explores a simpler, representation-centric paradigm. We introduce a novel method that uses a standard, non-equivariant autoregressive Transformer to generate molecules from a single, unified sequence. This sequence is constructed by interleaving discrete tokens for chemical topology (from SMILES) with discretized tokens for 3D geometry (from internal coordinates), reframing the entire task as a pure language modeling problem. Our primary discrete model, ALT_TOKEN, demonstrates the success of this strategy, achieving 99.0% chemical validity and generating structures with a low median energy of 3.07 kcal/mol that closely match the dataset distribution. These results outperform baselines using continuous representations. In conclusion, this work establishes that a standard Transformer, when paired with a carefully designed discrete and interleaved data representation, provides a viable, efficient, and less complex alternative for high-quality 3D molecular design.
Beskrivning
Ämne/nyckelord
Molecular generation, Internal coordinates, Language models
