Transformer-Based
Crystal Structure Generation from
OTC and Chemical Composition

Master’s thesis in Physics

ANU PETER


DEPARTMENT OF PHYSICS

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2026
www.chalmers.se

ii

www.chalmers.se


Master’s thesis 2026

Transformer-Based Crystal Structure Generation
from OTC and Chemical Composition

ANU PETER

Department of Physics
Chalmers University of Technology

Gothenburg, Sweden 2026


Transformer-Based Crystal Structure Generation
from OTC and Chemical Composition
ANU PETER

© ANU PETER 2026.

Supervisor: Henrik Klein Moberg, Department of Physics, Chalmers University of
Technology, Gothenburg, Sweden
Examiner: Anders Hellman, Department of Physics, Chalmers University of Tech-
nology, Gothenburg, Sweden

Master’s Thesis 2026
Department of Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Printed by Chalmers Reproservice
Gothenburg, Sweden 2026

v


Transformer-Based Crystal Structure Generation from OTC and Chemical Compo-
sition
Anu Peter
Department of Mathematical Sciences
Chalmers University of Technology

Abstract
Multi-component oxides, composed of three or more elements, offer a vast combina-
torial space of possible structures with tunable properties such as thermal stability,
ion conductivity, and catalytic activity. Exploring this space using traditional trial-
and-error methods is time-consuming and expensive.
This thesis investigates the use of a Transformer-based language model to generate
Crystallographic Information Files (CIFs), which encode atomic positions, lattice
parameters, and symmetry elements. The model is trained to learn relationships
between structural features and material properties, allowing it to propose new
CIFs representing potential novel crystal structures based on input descriptors like
oxygen transfer capacity and composition.
The results show that the Transformer model can capture complex structural pat-
terns and generate valid CIF sequences, demonstrating its potential as a data-driven
tool to accelerate the discovery and design of multi-component oxides.

vi


Acknowledgements

I would like to sincerely thank Anders Hellman, Professor of Chemical Physics,
Physics, my supervisor, and examiner, for his invaluable guidance, continuous sup-
port, and encouragement throughout this project. His knowledge and feedback were
crucial for my progress and greatly improved the quality of my work.
I am also deeply grateful to Henrik Klein Moberg for his guidance with the code
and model architecture. His experience in AI and technical support was essential
during the implementation phase.
Finally, I thank Rocío Mercado, Assistant Professor of Computer Science and En-
gineering at Chalmers University, for her guidance and for providing me with the
opportunity to carry out this project.

Anu Peter
Gothenburg, January 2026

vii


viii


List of Acronyms

Below is the list of acronyms used throughout this thesis, presented in alphabetical
order:

AI Artificial Intelligence
CIF Crystallographic Information File
CLC Chemical looping combustion
CSD Cambridge Structural Database
CSP Crystal structure prediction
DFT Density Functional Theory
DeCIFer Decoder for Crystallographic Information Files
GPU Graphics Processing Unit
HEO High entropy oxides
MAE Mean Absolute Error
MSE Mean Squared Error
LM Language Model
ML Machine Learning
NLP Natural Language Processing
OC Oxygen Carrier
OTC Oxygen Transfer Capability
PXRD Powder X-ray diffraction
RMSE Root Mean Squared Error

ix


Contents

List of Acronyms viii

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 Crystal Structures and CIF Files . . . . . . . . . . . . . . . . . . 3
2.2 Chemical Looping and Oxygen Carriers . . . . . . . . . . . . . 4
2.3 Transformer Model: Architecture, Mechanisms, and Relation to CIF

Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Sequence Modeling and the Next-Token Prediction Task . . . 7
2.3.3 Tokenization of Input Sequences . . . . . . . . . . . . . . . . . 8
2.3.4 Embedding and Positional Encoding . . . . . . . . . . . . . . 9
2.3.5 Token Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.6 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.7 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.8 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.9 Feed-Forward Layers, Residual Connections, and Layer Nor-

malization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.10 Autoregressive Generation and Masked Attention . . . . . . . 13
2.3.11 Next-Token Prediction Objective . . . . . . . . . . . . . . . . 13
2.3.12 Output Layer and Token Prediction . . . . . . . . . . . . . . . 14
2.3.13 Training Objective: Cross-Entropy Loss . . . . . . . . . . . . . 14
2.3.14 Autoregressive Generation of CIF Files . . . . . . . . . . . . . 14
2.3.15 Summary of the CIF Generation Workflow . . . . . . . . . . . 15
2.3.16 Model Architecture and Hyperparameters . . . . . . . . . . . 16
2.3.17 Vocabulary and Tokenization Challenges in CIF Data . . . . . 16

2.4 Computational Complexity of Transformer Models . . . . . . . . . . . 17
2.5 Overview of deCIFer . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

xi


Contents

3 Methods 19
3.0.1 Dataset Overview and Exploration . . . . . . . . . . . . . . . 19
3.0.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.0.3 Tokenization and Dataset Splitting . . . . . . . . . . . . . . . 21

3.1 Training Setup and Conditional Integration . . . . . . . . . . . . . . 22
3.1.1 Conditional Features and Objective . . . . . . . . . . . . . . . 22
3.1.2 Dual Conditioning Approach . . . . . . . . . . . . . . . . . . . 22

3.1.2.1 Encoder Conditioning . . . . . . . . . . . . . . . . . 22
3.1.2.2 Prefix Conditioning . . . . . . . . . . . . . . . . . . . 22

3.1.3 Precomputed Conditional Embeddings . . . . . . . . . . . . . 23
3.1.4 Batch Size and Block Size . . . . . . . . . . . . . . . . . . . . 23
3.1.5 Transformer Forward Pass . . . . . . . . . . . . . . . . . . . . 23
3.1.6 Prediction and Loss . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.7 Parameter Update . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.8 Training Pipeline Summary . . . . . . . . . . . . . . . . . . . 24
3.1.9 Checkpointing in Training . . . . . . . . . . . . . . . . . . . . 24

3.1.9.1 Purpose of Checkpoints . . . . . . . . . . . . . . . . 24
3.1.9.2 Contents of a Checkpoint . . . . . . . . . . . . . . . 24

3.1.10 CIF Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Results 27
4.0.1 Model Architecture Exploration and Selection . . . . . . . . . 27
4.0.2 Model Configuration and Training Setup . . . . . . . . . . . . 27
4.0.3 CIF Generation and Checkpoint Analysis . . . . . . . . . . . . 28
4.0.4 OTC and Energy Range and Distribution Analysis . . . . . . 29
4.0.5 Atomic Position Error Analysis . . . . . . . . . . . . . . . . . 30
4.0.6 Element-Level Generation Behavior . . . . . . . . . . . . . . . 31

5 Conclusion 33

Bibliography 35

xii


List of Figures

2.1 VESTA visualization of magnetite (Fe3O4). Fe atoms are shown in
brown and O atoms in red. . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Schematic of chemical looping combustion. The oxygen carrier alter-
nates between oxidized and reduced states, transferring oxygen from
the air reactor to the fuel reactor [1]. . . . . . . . . . . . . . . . . . . 6

2.3 Transformer architecture with encoder and decoder. The encoder
maps an input sequence into a continuous representation using stacked
layers of multi-head self-attention and feed-forward networks with
residual connections and layer normalization. Positional encodings
preserve the order of tokens. The decoder generates the output se-
quence autoregressively, combining masked self-attention over previ-
ous outputs with cross-attention to the encoder representations, en-
abling the model to learn complex dependencies between input and
output sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Workflow of input preparation for the Transformer: CIF tokens are
mapped to IDs, converted into embeddings, augmented with posi-
tional encodings, and then fed into the Transformer. . . . . . . . . . . 10

2.5 Each token embedding xi is projected into a query (Q), key (K), and
value (V ) vector. Similarity scores are computed via the scaled dot-
product QKT /

√
dk, and a softmax converts these scores into attention

weights. These weights determine how strongly each token attends
to others, and the final output is computed as a weighted sum of the
value vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Illustration of multi-head attention. Input embeddings are projected
into multiple queries, keys, and values, forming independent attention
heads. Each head captures different relationships, and their outputs
are concatenated and linearly projected to produce the final repre-
sentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 This diagram illustrates how the model generates CIF files autore-
gressively. At each step, the sequence of previously generated to-
kens (x1, . . . , xt) is fed into the Transformer model, which outputs a
probability distribution over the vocabulary. The next token xt+1 is
sampled from this distribution, allowing for diverse and valid crystal-
lographic sequences. The process repeats until the end-of-file token
is produced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

xiii


List of Figures

2.8 Schematic overview of deCIFer. The PXRD pattern is embedded and
prepended to the tokenized CIF sequence, which is then processed by
the transformer decoder to generate the CIF autoregressively. . . . . 18

3.1 Combined donut chart and Pearson correlation heatmap showing el-
ement composition and co-occurrence across 4,632 CIFs. . . . . . . . 20

3.2 Combined histogram showing OTC values and energy values across
the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Histogram showing the number of tokens per CIF, illustrating se-
quence length variation. . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Percentage of valid CIF files generated at each checkpoint during
training. The success rate shows a steady improvement, exceeding
90% after around 25,000 iterations and reaching a maximum of 99.4%,
indicating increased generation stability over time. . . . . . . . . . . . 29

4.2 Distribution of OTC and energy values for raw test data and gener-
ated CIF files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Scatter plot comparing raw and generated OTC and energy values. . 30
4.4 Mean Absolute Error (MAE) and Mean Squared Error (MSE) of pre-

dicted atomic positions compared to reference structures. The plot
shows that, although deviations exist, the overall error remains within
a reasonable range, indicating physically meaningful structure gener-
ation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Element-wise analysis of structural deviation (OTC-related metric). 32
4.6 Element-wise analysis of energy-related differences between well-generated

and deviated structures, illustrating how generation quality varies de-
pending on chemical composition. . . . . . . . . . . . . . . . . . . . . 32

xiv


List of Tables

2.1 CIF snippet for magnetite (Fe3O4) . . . . . . . . . . . . . . . . . . . 4
2.2 Illustrative next-token prediction probabilities. . . . . . . . . . . . . . 8
2.3 Example of CIF tokenization and embedding mapping . . . . . . . . 8

3.1 Example of CIF tokenization using the customized tokenizer. Each
token is mapped to a numerical ID from a fixed vocabulary. . . . . . 21

xv


List of Tables

xvi


1
Introduction

1.1 Background
Crystal structures determine how materials behave, and the arrangement of atoms
inside a structure influences mechanical strength, electronic properties, thermal sta-
bility, and chemical reactivity. To describe these arrangements, researchers use the
Crystallographic Information File (CIF) format, which stores cell parameters, sym-
metry elements, and atomic positions in a structured and consistent way. CIFs are
widely used in crystallography databases and simulation tools.
As the demand for new functional materials grows, the ability to generate crystal
structures becomes increasingly valuable. Traditional discovery is slow because the
search space is enormous, and both experimental and computational screening are
resource-intensive [1, 2] . Machine learning offers a new direction by learning struc-
tural patterns directly from large datasets of known materials and using them to
propose new candidates.
Transformer-based language models have become powerful tools for sequence gen-
eration. Although originally developed for natural language, they are effective at
learning long-range patterns in any structured sequence. Since CIFs can be viewed
as sequences of tokens representing atoms, symmetries, and coordinates, Transform-
ers are a promising choice for learning how to “write” new crystal structures. This
thesis explores how a Transformer model can understand information stored in CIFs
and generate new structures based on high-level material descriptors, such as oxygen
transfer capacity (OTC) and chemical composition.

1.2 Problem Description
Even with thousands of known structures available, we still lack tools that can
create new crystal structures directly from material properties. Existing methods
often depend on human intuition or computationally heavy simulations. This thesis
investigates a different approach: using a Transformer model to generate a complete
CIF file from OTC, composition, and energy values.
The idea is to train a model that not only understands the formatting of a CIF
but also learns how structural features relate to material properties. If successful,
the model could propose crystal structures that match a desired OTC. However,
generating CIFs is challenging. CIF sequences vary in length, contain many rare
tokens, and require accurate predictions of atomic positions. Precision errors or
sequence drift can easily break structural validity. GPU memory constraints further

1


1. Introduction

limit the model size and batch size during training.
This project explores whether it is possible to overcome these challenges and move
closer to automated property-driven structure generation.

1.3 Research Questions
To understand the potential and limitations of conditional CIF generation, this
thesis focuses on three main questions:

1. Can a Transformer generate a valid CIF structure from OTC and composition?
2. Where does the model fail within the CIF sequence, and why?
3. How do the model’s architecture and hyperparameters—such as sequence length,

tokenization method, and GPU memory limits—affect the accuracy and va-
lidity of the generated CIFs?

These questions guide the evaluation of the model and help reveal what aspects are
most important for improving CIF generation.

1.4 Scope and Limitations
To keep the project focused, several limitations apply:

• The dataset contains roughly 4632 CIFs, which is relatively small for training
deep Transformer models.

• Only a specific group of compositions is included, limiting generalization to
broader material families.

• The study focuses solely on Transformer architectures; diffusion models, graph
networks, or symmetry-aware generative methods are not explored.

• Conditioning information is limited to OTC, composition, and energy. Other
structural or thermodynamic properties are not considered.

These constraints define the boundaries of the study and help maintain a clear focus
on evaluating Transformer models for conditional CIF generation.

2


2
Theory

This chapter presents the fundamental concepts necessary for understanding and
interpreting this study.

2.1 Crystal Structures and CIF Files

The arrangement of atoms in a crystal, known as the crystal structure, fundamen-
tally determines the physical and chemical properties of a material. Thermal sta-
bility, ion conductivity, and catalytic activity are all influenced by how atoms are
positioned within the lattice. Understanding and representing these structures in
a standardized format is crucial for experimental characterization, computational
modeling, and data-driven material design [3].

Crystallographic Information Files (CIFs) provide a standardized format to describe
crystal structures. Each CIF encodes three main components[4][5]:

1. Unit cell parameters: The lengths a, b, c and angles α, β, γ define the repeating
unit of the crystal.

2. Symmetry information: The space group specifies the symmetry operations
present in the structure.

3. Atomic positions: Fractional coordinates indicate the location of each atom
within the unit cell.

A widely studied example is magnetite (Fe3O4), a cubic oxide with a spinel structure.
Table 2.1 shows a simplified CIF snippet for magnetite, highlighting the key features
of cell parameters, symmetry, and atomic positions[6].

3


2. Theory

Keyword / Column Description Example
_cell_length_a Unit cell length along x 8.396 Å
_cell_length_b Unit cell length along y 8.396 Å
_cell_length_c Unit cell length along z 8.396 Å
_cell_angle_alpha Angle 90°
_cell_angle_beta Angle 90°
_cell_angle_gamma Angle 90°
_symmetry_space_group_name_H-M Space group Fd-3m
_atom_sitelabel Atom label Fe1
_atom_sitetypesymbol Atom type Fe
_atom_sitefractx Fractional x coordinate 0.0000
_atom_sitefracty Fractional y coordinate 0.0000
_atom_sitefractz Fractional z coordinate 0.0000
_atom_sitelabel Atom label Fe2
_atom_sitetypesymbol Atom type Fe
_atom_sitefractx Fractional x coordinate 0.6250
_atom_sitefracty Fractional y coordinate 0.6250
_atom_sitefractz Fractional z coordinate 0.6250
_atom_sitelabel Atom label O1
_atom_sitetypesymbol Atom type O
_atom_sitefractx Fractional x coordinate 0.2600
_atom_sitefracty Fractional y coordinate 0.2600
_atom_sitefractz Fractional z coordinate 0.2600

Table 2.1: CIF snippet for magnetite (Fe3O4)

To complement the CIF data, Figure 2.1 shows a VESTA visualization of mag-
netite (Fe3O4). The 3D unit cell representation displays the positions of Fe and O
atoms, illustrating how CIF information translates into actual atomic positions in
the crystal lattice [7].

2.2 Chemical Looping and Oxygen Carriers
Chemical looping combustion (CLC) is an innovative combustion technology in
which a solid oxygen carrier (OC) transfers oxygen from an air reactor to a fuel
reactor. This process allows fuel oxidation to occur without direct contact between
the fuel and atmospheric air, effectively reducing nitrogen oxide (NOx) emissions
and improving overall combustion efficiency [8, 9].
An oxygen carrier is a solid material capable of reversibly incorporating and releasing
oxygen during the CLC cycle[2]. In the air reactor, the OC is oxidized by oxygen
from the air, while in the fuel reactor, it is reduced, transferring oxygen to the
fuel to facilitate combustion. A key performance metric of an oxygen carrier is its
oxygen transfer capacity (OTC), which quantifies the amount of oxygen that can be
delivered per unit mass of the material. Higher OTC values indicate more efficient
oxygen transfer, leading to more complete fuel oxidation and fewer cycles needed
for combustion [10, 11],

4


2. Theory

Figure 2.1: VESTA visualization of magnetite (Fe3O4). Fe atoms are shown in
brown and O atoms in red.

High-entropy oxides (HEOs), which are multi-component oxides consisting of three
or more elements, have shown significant promise as oxygen carriers. Their compo-
sitional complexity increases structural stability at high temperatures and enables
high oxygen transfer capacity, making them ideal candidates for sustainable CLC
processes. [12, 13]

A schematic representation of the CLC process is shown in Figure 2.2, highlighting:

• The flow of oxygen from the air reactor to the fuel reactor.

• The cyclic oxidation and reduction of the oxygen carrier.

• The concept of oxygen transfer capacity (OTC) along the cycle.

5


2. Theory

Figure 2.2: Schematic of chemical looping combustion. The oxygen carrier alter-
nates between oxidized and reduced states, transferring oxygen from the air reactor
to the fuel reactor [1].

In this study, Transformer-based language models are employed to generate Crys-
tallographic Information Files (CIFs) for multi-component oxides. By learning the
relationships between atomic arrangements and material properties such as OTC,
the model can propose novel oxygen carrier candidates efficiently. This approach
offers a data-driven route to accelerate the discovery and design of high-performance
materials for chemical looping applications [2, 14].

2.3 Transformer Model: Architecture, Mechanisms,
and Relation to CIF Generation

2.3.1 Introduction
The Transformer architecture marks a significant evolution in sequence modeling,
replacing recurrent and convolutional mechanisms with a unified attention-based
framework. First introduced by Vaswani et al. (2017) in Attention Is All You
Need, the Transformer demonstrates that sequential dependencies can be captured
solely through self-attention, enabling efficient parallelization and improved learning
of long-range interactions. Its scalability, stability, and expressiveness have since
established it as the foundational model for modern generative systems.[15]
In this thesis, a Transformer is adopted to generate Crystallographic Informa-
tion File (CIF) sequences in an autoregressive manner. Since a CIF file contains
structured crystallographic information—such as lattice parameters, symmetry op-
erations, atom labels, and fractional coordinates—representing it as a token sequence

6


2. Theory

allows the entire generation task to be formulated as a conditional next-token pre-
diction problem, analogous to natural-language modeling.

Figure 2.3: Transformer architecture with encoder and decoder. The encoder maps
an input sequence into a continuous representation using stacked layers of multi-head
self-attention and feed-forward networks with residual connections and layer normal-
ization. Positional encodings preserve the order of tokens. The decoder generates
the output sequence autoregressively, combining masked self-attention over previous
outputs with cross-attention to the encoder representations, enabling the model to
learn complex dependencies between input and output sequences.

2.3.2 Sequence Modeling and the Next-Token Prediction
Task

Before introducing the Transformer architecture, it is important to outline the task
it is designed to solve: predicting the next token in a sequence.
Sequence generation models learn statistical patterns in token order and structure.
Given a sequence of tokens

(x1, x2, . . . , xt),

the model predicts the probability distribution of the next token xt+1:

P (xt+1 | x≤t).

7


2. Theory

This is known as autoregressive modeling, where each new token depends on all
previously generated tokens[16].
Consider the natural-language sequence:

“The cat sat on the . . . ”
From its training data, a model may infer that the word mat commonly completes
this phrase. It then outputs a probability distribution over potential next tokens:

Candidate Token Probability
mat 0.82
floor 0.07
bed 0.04
chair 0.03

Table 2.2: Illustrative next-token prediction probabilities.

The model selects the token with the highest probability to extend the sequence:
“The cat sat on the mat”

The prediction process then continues iteratively until a designated end-of-sequence
marker is reached.
In this thesis, the same autoregressive generation mechanism is applied to CIF se-
quences. Instead of words, the tokens represent crystallographic elements—unit-cell
parameters, symmetry labels, atom identifiers, and fractional coordinates—allowing
the Transformer to construct a CIF file token-by-token.

2.3.3 Tokenization of Input Sequences
Transformers operate on discrete tokens, not raw text. In natural language, to-
kens are typically words or subwords. In crystallography, tokens correspond to CIF
keywords, numbers, atomic symbols, and structural markers.
Each token is assigned an integer ID and mapped to a vector embedding of dimension
dmodel:

xt
token ID−−−−−→ IDt

embedding−−−−−−→ et ∈ Rdmodel .

CIF snippet Token ID Embedding vector (et)
_cell_length_a 1 [0.12, -0.34, 0.56, . . . ]
5.430 2 [-0.21, 0.88, 0.15, . . . ]
_cell_length_b 3 [0.09, -0.44, 0.72, . . . ]
5.430 2 [-0.21, 0.88, 0.15, . . . ]
loop_ 4 [0.05, -0.12, 0.44, . . . ]
Si 5 [0.23, -0.65, 0.77, . . . ]
0.000 6 [-0.11, 0.49, 0.21, . . . ]

Table 2.3: Example of CIF tokenization and embedding mapping

The token IDs represent unique discrete identifiers for each type of CIF component,
while the embedding vectors are learned during training and capture semantic and

8


2. Theory

structural relationships between tokens. Repeated values such as ‘5.430‘ share the
same token ID to reduce the vocabulary size. Special tokens such as ‘loop_‘ enable
the model to detect block structures and repeated patterns in the CIF[17].

2.3.4 Embedding and Positional Encoding

After tokenization, each CIF token is first assigned a unique integer ID. However, the
Transformer model cannot work directly with discrete IDs and requires continuous
vector representations. This is achieved through embeddings, which convert tokens
into dense vectors that capture semantic and structural information.

2.3.5 Token Embedding

Each token xt is mapped to a dmodel-dimensional vector using a learned embedding
matrix E ∈ R|V |×dmodel , where |V | is the size of the token vocabulary:

et = E(xt), t = 1, 2, . . . , T.

These embeddings allow the model to understand relationships between tokens[18].
For example, tokens representing similar fractional coordinates or atomic species
that often appear in the same crystallographic context will have similar vector rep-
resentations.

2.3.6 Positional Encoding

Since Transformers process all tokens in parallel and do not inherently know their
order, positional information must be added explicitly. This is done through sinu-
soidal positional encodings:

PE(pos,2i) = sin
(

pos

100002i/dmodel

)
, PE(pos,2i+1) = cos

(
pos

100002i/dmodel

)
,

where pos indicates the token’s position in the sequence and i indexes the dimension
within the embedding vector. The final input for the Transformer is obtained by
summing the token embedding and its positional encoding:

z(0)
t = et + PEt.

This combination provides the model with both the identity of the token and its
position in the sequence, which is essential for learning structural and sequential
patterns in CIF files, such as the order of keywords, symmetry blocks, and atomic
coordinates[18].

9


2. Theory

Figure 2.4: Workflow of input preparation for the Transformer: CIF tokens are
mapped to IDs, converted into embeddings, augmented with positional encodings,
and then fed into the Transformer.

2.3.7 Attention Mechanism
The core of the Transformer is the attention mechanism, which allows the model
to capture relationships between all tokens in a sequence simultaneously. For CIF
generation, attention enables the model to understand dependencies such as how
unit cell parameters influence each other, how symmetry operations relate to atomic
positions, and how element types correspond to fractional coordinates.
For each token, the model computes three vectors: queries (Q), keys (K), and values
(V ):

Q = XW Q, K = XW K , V = XW V ,

where X is the input representation (embedding + positional encoding), and W Q, W K , W V

are learned projection matrices. The attention between tokens is calculated via the
scaled dot-product:

Attention(Q, K, V ) = softmax
(

QKT

√
dk

)
V,

where dk is the dimension of the key vectors. This computation assigns higher

10


2. Theory

weights to tokens that are more relevant to the current token, allowing the model
to focus on the most important relationships in the sequence.

Figure 2.5: Each token embedding xi is projected into a query (Q), key (K),
and value (V ) vector. Similarity scores are computed via the scaled dot-product
QKT /

√
dk, and a softmax converts these scores into attention weights. These

weights determine how strongly each token attends to others, and the final out-
put is computed as a weighted sum of the value vectors.

2.3.8 Multi-Head Attention
A single attention head captures one type of relationship between tokens, but CIF
sequences contain multiple simultaneous dependencies, including connections be-
tween lattice parameters, atomic positions, element types, and symmetry operations.
Multi-head attention addresses this by computing several attention operations in
parallel.
For each head i, the input embeddings with positional encodings are linearly pro-
jected into queries, keys, and values, and attention is computed independently:

headi = Attention(XW Q
i , XW K

i , XW V
i ),

where W Q
i , W K

i , W V
i are learned matrices for the i-th head. The outputs of all heads

are concatenated and projected to form the final representation:

MultiHead(Q, K, V ) = Concat(head1, . . . , headh)W O.

This allows each head to specialize in capturing different relationships, for example,
one may focus on element-to-coordinate dependencies, another on lattice angles, and
a third on block ordering. Multi-head attention thus enables the model to better
capture the complex patterns inherent in CIF sequences.

11


2. Theory

Figure 2.6: Illustration of multi-head attention. Input embeddings are projected
into multiple queries, keys, and values, forming independent attention heads. Each
head captures different relationships, and their outputs are concatenated and linearly
projected to produce the final representation.

For structured formats like CIFs, these mechanisms are complementary. Self-attention
enables the model to learn relationships between symmetry information, lattice pa-
rameters, and atomic positions, while positional encoding ensures the correct sequen-
tial ordering of fields. Multi-head attention allows simultaneous modeling of multi-
ple structural dependencies. Together, these components allow the Transformer to
generate coherent, valid CIF files in an autoregressive manner.

2.3.9 Feed-Forward Layers, Residual Connections, and Layer
Normalization

After capturing relationships between tokens through the multi-head attention mech-
anism, each Transformer layer includes a position-wise fully connected feed-forward
network (FFN)[18, 19] . This layer introduces nonlinearity and enhances the model’s
ability to process complex interactions. Unlike attention, which mixes informa-
tion across tokens, the FFN operates independently on each token, transforming its
context-aware representation into a higher-level feature space. The FFN is defined
as:

FFN(x) = max(0, xW1 + b1)W2 + b2,

where W1 and W2 are learned weight matrices, b1 and b2 are bias vectors, and
the ReLU activation introduces nonlinearity. In the context of CIF generation,
this layer allows the model to refine information captured from attention, such as

12


2. Theory

combining dependencies between unit cell parameters, symmetry operations, and
atomic coordinates into coherent representations.
To further improve training stability and facilitate learning in deep architectures,
each sub-layer—including multi-head attention and the feed-forward network—is
equipped with a residual connection followed by layer normalization[20]. The resid-
ual connection adds the original input of the sub-layer to its output, which helps
gradients flow during backpropagation and prevents vanishing or exploding gradient
issues[15]. Mathematically, for a sub-layer function Sublayer(·), the output with
residual connection and layer normalization is:

z′
t = LayerNorm(zt + Sublayer(zt)),

where zt is the input token representation and z′
t is the normalized output.

For CIF generation, this mechanism ensures that each token retains its original in-
formation while also incorporating complex patterns learned from the attention and
feed-forward layers. For example, a token representing _atom_site_fract_x main-
tains its initial embedding but also integrates contextual information from lattice
parameters, symmetry operations, and other atomic positions.

2.3.10 Autoregressive Generation and Masked Attention
After processing token representations through multi-head attention and the feed-
forward network, the Transformer decoder generates sequences autoregressively.
During training, given a sequence (x1, . . . , xt), the model learns to predict the next
token xt+1 based on all previous tokens[21] . To ensure the model does not attend
to future positions, a causal mask is applied:

Mask(i, j) =

0, j ≤ i,

−∞, j > i.

This mask is added to the QKT matrix before the softmax computation, preventing
information leakage.

2.3.11 Next-Token Prediction Objective

For a vocabulary V , the model outputs logits ℓt ∈ R|V | for each position. Applying
a softmax converts these into probabilities:

P (xt+1 | x≤t) = softmax(ℓt).

The training objective is to minimize the negative log-likelihood:

L = −
T∑

t=1
log P (xt+1 | x≤t).

13


2. Theory

2.3.12 Output Layer and Token Prediction
After processing through multiple Transformer layers—including multi-head atten-
tion, feed-forward networks, residual connections, and layer normalization—each
token obtains a final context-aware representation. This representation encodes
both the identity of the token and its relationships with all other tokens in the
sequence. The output layer then maps these continuous vectors to a probability
distribution over the vocabulary, allowing the model to predict the next token in
an autoregressive manner. This is performed using a linear projection followed by a
softmax function:

P (xt+1 | x1, . . . , xt) = softmax(z(L)
t W O + bO),

where z(L)
t is the representation of token t from the final Transformer layer, and W O

and bO are the learned projection weights and biases.
In the context of CIF generation, this mechanism allows the model to sequentially
predict crystallographic tokens, including keywords, numerical values, element sym-
bols, and atomic coordinates, by leveraging the context provided by previously gen-
erated tokens.
In summary, the output layer converts the final context-aware token embeddings into
vocabulary probabilities, iteratively selects the next token, and continues this pro-
cess until a complete CIF file is generated. This ensures that the generated sequence
is both structurally coherent and compliant with crystallographic conventions.

2.3.13 Training Objective: Cross-Entropy Loss
The Transformer is trained to generate a CIF file one token at a time. At each step,
the model looks at all previous tokens and tries to predict the next one. This training
setup is called autoregressive learning. For a sequence of tokens (x1, . . . , xT ), the
overall goal is to assign high probability to the correct next token. This idea can be
written as

L =
T −1∑
t=1

log P (xt+1 | x1, . . . , xt).

In practice, the model minimizes the cross-entropy loss, which is the negative log-
likelihood of the correct next token:

CE = −
T −1∑
t=1

log P (xt+1 = x̂t+1),

where x̂t+1 denotes the true next token from the training data.
Cross-entropy measures how different the predicted probability distribution is from
the true token. A lower cross-entropy value means that the model predicts the
next token more accurately. Overall, cross-entropy is a simple, stable, and effective
objective for learning the generative structure of CIF sequences [19].

2.3.14 Autoregressive Generation of CIF Files
Once the model is trained, it generates CIF files one token at a time in an autore-
gressive manner. At each step t, the model looks at all the tokens generated so far

14


2. Theory

(x1, . . . , xt) and predicts a probability distribution over the next token in the vocab-
ulary. Instead of always choosing the most likely token, the next token is randomly
sampled from this distribution using stochastic sampling. This introduces variation
in the generated sequences, allowing the model to produce multiple valid CIFs from
the same starting point.[22, 23]
The generation process continues step by step until a special end-of-file token is
reached. By predicting tokens in sequence, the model preserves the correct order
of crystallographic information, such as listing cell parameters before symmetry op-
erations and atomic coordinates after the atom labels. This approach ensures that
the generated CIFs are complete and consistent, maintaining the correct relation-
ships between lattice parameters, element types, symmetry operators, and atomic
positions[3].

Figure 2.7: This diagram illustrates how the model generates CIF files autoregres-
sively. At each step, the sequence of previously generated tokens (x1, . . . , xt) is fed
into the Transformer model, which outputs a probability distribution over the vo-
cabulary. The next token xt+1 is sampled from this distribution, allowing for diverse
and valid crystallographic sequences. The process repeats until the end-of-file token
is produced.

2.3.15 Summary of the CIF Generation Workflow
The complete CIF generation process integrates tokenization, embedding, positional
encoding, multi-head attention, feed-forward transformation, residual normaliza-
tion, output projection, and autoregressive decoding. Numerical and symbolic CIF
information is converted into tokens, transformed into continuous representations,
and processed across multiple Transformer layers to capture both local and global
crystallographic patterns. The output layer predicts the next token in the sequence,
and the model iteratively constructs a full CIF file using a chosen sampling strategy.
This workflow enables a data-driven, end-to-end approach for generating crystallo-
graphic structures and forms the foundation of the CIFFormer model developed in

15


2. Theory

this thesis.

2.3.16 Model Architecture and Hyperparameters
The CIF generation model is based on a Transformer architecture with stacked
layers that process sequences of atomic positions and features. Each layer consists
of multi-head self-attention and feed-forward networks, which allow the model to
capture both local and long-range dependencies within the crystal structures.
A separate pretrained encoder is used to provide additional input features for the
CIF sequences. This encoder is frozen, meaning its weights are not updated during
training, and its outputs are precomputed to reduce computational overhead. The
pretrained encoder is therefore treated as a fixed feature extractor rather than a
trainable component of the main model.
The key hyperparameters of the Transformer model are summarized using common
notation:

• Number of layers (L): 8 layers in the Transformer used for CIF generation.
• Number of attention heads (H): 8 heads per layer, allowing the model to

focus on multiple aspects of the sequence simultaneously.
• Hidden embedding dimension (dmodel): 512, determining the size of in-

ternal representations for each token.
• Dropout rate (pdrop): 0.1, applied to attention and feed-forward layers to

prevent overfitting.
• Batch size (B): 16 sequences per training step, balancing GPU memory

usage and training stability.
• Maximum sequence length (nseq): 1750 tokens, accommodating CIF se-

quences with many atoms or symmetry operations.
• Learning rate (η): 0.001, with a decay schedule down to 0.00001 for stable

convergence.
• Maximum iterations (Niter): 50,000 training steps.
• Warmup iterations (Nwarm): 100, gradually increasing the learning rate at

the start of training.
• Gradient clipping (gclip): 1.0, to prevent instability from large gradients.
• Conditional input size (dcond): 2 features used to guide CIF generation.

These hyperparameters were selected to balance model performance, training sta-
bility, and the computational constraints of the available GPU resources[24, 19] .

2.3.17 Vocabulary and Tokenization Challenges in CIF Data
Tokenizing CIFs files is more complicated than tokenizing normal text because CIFs
contain numbers, symmetry rules, and nested crystallographic information. The
vocabulary needs to cover both textual elements, like _symmetry_space_group or
element symbols, and numerical values for lattice constants, angles, and atomic
coordinates. Since floating-point numbers cannot be directly represented as tokens,
they are usually split into smaller sub-tokens or discretized, which makes sequences
longer and the model more complex.

16


2. Theory

Additionally, CIF files follow a strict syntax: certain sections must appear in a
specific order, and numerical values must be precise[6] . If tokenization is done
poorly, it can cause information loss, inconsistent number representation, or even
invalid crystal structures during generation. Because of these challenges, designing a
careful tokenization strategy is essential. A good vocabulary helps the Transformer
understand both the structure (grammar) and the numbers in CIF files, enabling it
to generate valid and accurate crystal structures[4, 25].

2.4 Computational Complexity of Transformer Mod-
els

The Transformer architecture is highly effective but comes with significant com-
putational and memory requirements, mainly due to the self-attention mechanism.
Self-attention examines relationships between all pairs of tokens in a sequence. For
a sequence of length n and hidden dimension d, this requires roughly O(n2d) oper-
ations and O(n2) memory to store the attention weights. Here, the notation O(n2)
indicates that the computational cost and memory usage grow approximately with
the square of the sequence length. In other words, if the sequence length doubles, the
number of computations and memory needed roughly quadruples. In CIF sequences,
which can be long due to a large number of atoms or symmetry operations, this can
become a major computational bottleneck during both training and inference.
Other factors, such as batch size and the number of layers, further increase memory
usage, potentially limiting the sequence lengths or batch sizes that can be used
effectively. These constraints may affect the model’s expressiveness and the speed of
convergence during training. Nevertheless, modern GPUs and careful management
of computational resources make it possible to train Transformer-based models for
CIF generation within practical limits.[18]

2.5 Overview of deCIFer
deCIFer is an autoregressive transformer model designed for crystal structure pre-
diction (CSP) from powder X-ray diffraction (PXRD) data. The model generates
Crystallographic Information Files (CIFs), which encode the atomic structure of a
crystal, by conditioning on PXRD patterns. A small neural network first embeds
the PXRD data into a learnable vector, which is prepended to the sequence of CIF
tokens and used to guide the transformer decoder during generation. This allows de-
CIFer to produce crystal structures that are consistent with experimental diffraction
measurements [26]
To efficiently handle CIF sequences of variable length, deCIFer uses a sequence-
packing strategy along with attention masking, ensuring that each structure is pro-
cessed independently while maintaining internal context. During training, PXRD
patterns are augmented with simulated noise and peak broadening to mimic real
experimental conditions, enhancing the model’s robustness. Generated structures
are evaluated based on agreement with reference PXRD patterns and structural

17


2. Theory

validity.
Figure 2.8 illustrates the overall workflow of deCIFer. PXRD data is first embedded
and combined with tokenized CIF sequences, which are then fed into the transformer
decoder. The autoregressive process predicts the CIF tokens sequentially, resulting
in a valid crystal structure aligned with the input PXRD pattern. By integrating ex-
perimental data directly into the generative process, deCIFer bridges computational
CSP and experimental diffraction analysis, providing a powerful tool for materials
characterization and discovery.

Figure 2.8: Schematic overview of deCIFer. The PXRD pattern is embedded and
prepended to the tokenized CIF sequence, which is then processed by the transformer
decoder to generate the CIF autoregressively.

18


3
Methods

This chapter outlines the methodology followed to generate CIF files using a Trans-
former model. It covers the preparation of data, model architecture, training proce-
dures, and evaluation strategies to assess generation quality.

3.0.1 Dataset Overview and Exploration
The dataset used in this study contains 4,632 perovskite structures in CIF format,
representing multicomponent oxides. These structures include 25 different elements
mixed across the A and B sites. Each structure has information on its composition,
energy, and oxygen transfer capacity (OTC), which are important for understanding
stability and functionality.
To better understand the data,the elemental composition and co-occurrence pat-
terns were analyzed using a combined figure showing a donut chart and a Pearson
correlation heatmap. This illustrates which elements are most common and how
frequently they appear together, providing insight into the structural combinations
the model needs to learn (see Figure 3.1).
The OTC and energy distributions were also analyzed across all CIFs. The OTC
histogram shows how oxygen transfer capacity varies among structures, highlighting
those with particularly high or low values. The energy histogram illustrates the
range of structural stability across the dataset (see Figure 3.2).
Finally, the token number distribution was analyzed to understand the variabil-
ity in CIF sequence lengths. Most structures have a moderate number of tokens,
but some very long sequences could challenge the Transformer during training (see
Figure 3.3).
These analyzes provide a clear picture of the dataset and guide decisions regarding
preprocessing, tokenization, and model design for the Transformer.

3.0.2 Preprocessing
The raw CIF files cannot be used directly for training and therefore require pre-
processing. This step prepares the data in a consistent and clean form so that
the Transformer can focus on learning structural patterns rather than addressing
formatting issues.
Each CIF file is first checked to ensure that it represents a valid crystal structure.
Structures with partial atomic occupancy are excluded by default, as incomplete
occupancies introduce uncertainty in atomic positions. Oxygen transfer capacity

19


3. Methods

Figure 3.1: Combined donut chart and Pearson correlation heatmap showing ele-
ment composition and co-occurrence across 4,632 CIFs.

Figure 3.2: Combined histogram showing OTC values and energy values across
the dataset.

Figure 3.3: Histogram showing the number of tokens per CIF, illustrating sequence
length variation.

(OTC) and energy values are extracted and normalized to place all structures on a
comparable scale.
The CIF text is then simplified and standardized. Unnecessary header comments are

20


3. Methods

removed, numerical values are rounded to a fixed precision, and atomic information
is written in a consistent format. Chemical composition, atomic species, and space
group information are also extracted, as they are later used during training and
dataset splitting.
Overall, preprocessing reduces noise in the data, limits extreme variations between
structures, and produces a clean and uniform representation of CIF files.

3.0.3 Tokenization and Dataset Splitting

After preprocessing, each CIF file is converted into a sequence of discrete tokens
using a customized tokenizer designed specifically for crystallographic data. Instead
of relying on character-level or word-level tokenization, the tokenizer uses a fixed
vocabulary that explicitly includes CIF keywords, element symbols, space-group
labels, digits, and punctuation. This design helps preserve the syntactic structure
of CIF files while keeping the representation interpretable.
Each token is mapped to a unique numerical identifier (token ID), which is the
actual input to the Transformer model. Numerical values are not treated as single
tokens; instead, digits and decimal points are tokenized separately. This allows
the model to learn numerical patterns directly from the sequence structure rather
than relying on predefined numeric embeddings. Tokens that do not belong to the
predefined vocabulary are replaced with a special unknown token (<unk>). Space-
group symbols are disambiguated by appending a suffix to avoid confusion with
element names.
Table 3.1 shows a simplified example of how CIF text is tokenized and converted
into token IDs.

Table 3.1: Example of CIF tokenization using the customized tokenizer. Each
token is mapped to a numerical ID from a fixed vocabulary.

CIF Fragment Token Sequence Token IDs
_cell_length_a 7.62 _cell_length_a, 7, ., 6, 2 128, 7, 94, 6, 2
Ba Ti O Ba, Ti, O 64, 29, 7
data_sample data_, sample 211, 56
Pm3m Pm_sg, 3, m 301, 3, 145

Once tokenization is completed, token sequences are padded to a fixed maximum
length to enable batch training. A special padding token is used so that shorter
sequences do not affect model learning.
The tokenized dataset is then randomly divided into three subsets: 80% for training,
10% for validation, and 10% for testing. The training set is used to learn model
parameters, the validation set is used to monitor performance and tune hyperpa-
rameters, and the test set is reserved for final evaluation. This split ensures a fair
assessment of the model’s ability to generalize to unseen crystal structures.

21


3. Methods

3.1 Training Setup and Conditional Integration
After tokenization and dataset splitting, the model is trained on sequences of to-
ken IDs representing crystal structures. Each structure is paired with conditional
features, specifically Oxygen Transfer Capacity (OTC) and energy values. These
features provide physical guidance to the model, helping it generate structures con-
sistent with desired properties.

3.1.1 Conditional Features and Objective
Let X = (x1, x2, . . . , xT ) represent a token sequence of length T for a crystal struc-
ture, and let c = [OTC, Energy] be the conditional vector associated with this
structure.
The model aims to learn the conditional probability distribution:

P (X | c; θ) = P (x1, x2, . . . , xT | c; θ)

Using the chain rule of probability, this can be factorized as:

P (X | c; θ) =
T∏

t=1
P (xt | x1, x2, . . . , xt−1, c; θ)

Here:
• xt is the token at position t.
• θ are the model parameters.
• c ensures that each predicted token considers the conditional features.

3.1.2 Dual Conditioning Approach
The model integrates conditional features using encoder embeddings and prefix
tokens.

3.1.2.1 Encoder Conditioning

Conditional features c are passed through a pretrained encoder network fenc to create
a high-dimensional embedding:

ec = fenc(c) ∈ Rdemb

This embedding is added to each token embedding in the sequence:

x̃t = xt + ec, t = 1, . . . , T

3.1.2.2 Prefix Conditioning

The conditional vector c is also converted into discrete tokens and prepended to the
token sequence:[?, 21]

Xinput = [ctokens, x1, x2, . . . , xT ]

22


3. Methods

This allows the transformer to process conditional information as part of the input
sequence, influencing attention directly.

3.1.3 Precomputed Conditional Embeddings
To improve efficiency, embeddings ec for all unique feature combinations are pre-
computed:

{ec | c ∈ unique OTC-energy pairs}

During training, the model fetches these embeddings rather than recomputing them,
reducing computation and memory cost.

3.1.4 Batch Size and Block Size
The model processes sequences in batches of size B = 32. Each batch is represented
as a tensor:

Xbatch ∈ RB×T ×demb

Here T = 1750 is the block size, representing the maximum number of tokens per
sequence (including prefix). Shorter sequences are padded with a special token so
that all sequences in a batch have the same length.

3.1.5 Transformer Forward Pass

For each layer l = 1, 2, . . . , L, the transformer computes hidden states h
(l)
t for each

token t:

h
(l)
t = TransformerLayer(h(l−1)

1 , . . . , h
(l−1)
T )

where h
(0)
t = x̃t. The final layer outputs h

(L)
t , which contains contextualized repre-

sentations for prediction.

3.1.6 Prediction and Loss
At each token position t, the model predicts a probability distribution over the
vocabulary:

ŷt = softmax(Woh
(L)
t + bo)

The cross-entropy loss is used to measure the difference between predicted prob-
abilities ŷt and true token IDs yt:

L = − 1
BT

B∑
b=1

T∑
t=1

log ŷ
(b)
t [y(b)

t ]

Minimizing L ensures that the predicted sequence matches the true CIF tokens while
respecting conditional features. [19]

23


3. Methods

3.1.7 Parameter Update
Model parameters θ are updated using AdamW optimizer with gradient clipping:[24]

θ ← θ − η
∇θL

∥∇θL∥2 + ϵ

Gradient clipping ensures stable training by preventing exploding gradients.[15]

3.1.8 Training Pipeline Summary
The training pipeline can be summarized as follows:

1. Tokenized CIF sequences and conditional features c are prepared.
2. Conditional embeddings are obtained via encoder and prefix tokens.
3. Transformer processes the sequences and computes hidden states.
4. Probabilities for the next token are predicted.
5. Cross-entropy loss is computed.
6. Model parameters are updated.
7. Validation is performed and checkpoints are saved.

This setup ensures that the model learns the conditional distribution P (X | c) ef-
fectively, enabling the generation of physically meaningful and structurally accurate
crystal sequences.

3.1.9 Checkpointing in Training
Checkpoints are essential for saving and restoring the model’s state during training
and evaluation. They preserve progress and allow resumption, evaluation, or further
analysis.

3.1.9.1 Purpose of Checkpoints

Checkpoints serve several important purposes:
• Resume Training: Allows continuation after interruptions without losing

progress.
• Evaluation: Enables testing on unseen data using saved models.
• Best Model Saving: Keeps the model with the lowest validation loss for

later use.
• Reproducibility: Ensures that results can be replicated for debugging or

further experiments.

3.1.9.2 Contents of a Checkpoint

A typical checkpoint includes:
1. Model State Dictionary: Contains all trainable parameters, including em-

beddings, attention layers, feedforward layers, and normalization weights.
2. Optimizer State: Stores optimizer variables such as learning rate, momen-

tum, and gradient history to allow consistent resumption of training.
3. Training Configuration: Includes model architecture details, hyperparam-

eters, and dataset paths.

24


3. Methods

4. Training Metrics: Tracks the current iteration, best validation loss, and
early stopping counters.

5. Pretrained Encoder State : Includes weights of any pretrained encoder
integrated in the model.

Checkpoints are saved periodically during training:
• After every evaluation interval.
• If validation loss improves, the best-performing model is saved.
• If checkpoint is set to true, checkpoints are saved after every evaluation.

Saving is performed asynchronously to prevent blocking the training loop.
Checkpoints are loaded during evaluation or to resume training. The process in-
cludes:

1. Loading the checkpoint file using torch.load with appropriate device map-
ping.

2. Extracting the state dictionary of the model.
3. Rebuilding the model according to the saved configuration.
4. Loading the state dictionary into the model using a function load_state_dict.
5. Returning the fully restored model for evaluation or training continuation.

Checkpoints are saved as .pth files, which is the standard format for PyTorch models
and supports compatibility across CPU and GPU devices.

3.1.10 CIF Generation
CIF generation refers to the process by which the trained Transformer model pro-
duces new crystal structures in the form of Crystallographic Information Files (CIFs).
Each generated CIF represents a candidate multicomponent perovskite structure,
including lattice parameters, atomic species, and fractional atomic positions.
The model generates CIFs in an autoregressive manner. Starting from an initial
prompt, the model predicts one token at a time, where each prediction depends on
all previously generated tokens. This allows the model to learn and reproduce the
sequential structure of CIF files, including repeated patterns such as cell parameters,
symmetry information, and atomic position blocks.
Generation is guided by conditional information provided to the model. This in-
cludes the chemical composition and, when applicable, target material properties
such as oxygen transfer capacity (OTC) and energy. These conditions help steer the
generation process toward structures with desired chemical and functional charac-
teristics.
Producing valid CIFs is challenging due to their length and the strict syntax of CIF
files. CIF sequences can contain hundreds or thousands of tokens, requiring the
model to maintain long-range consistency. In addition, CIF files follow rigid for-
matting rules, and small token errors can lead to invalid or non-physical structures.
The complexity of multicomponent perovskites further increases the difficulty, as
multiple elements may share lattice sites with varying occupancies.
After generation, the produced CIF files are evaluated to assess their structural va-
lidity and chemical consistency. Valid CIFs provide insight into the model’s ability to
learn crystallographic rules and can be used to explore new perovskite compositions
within a large and complex design space.

25


3. Methods

26


4
Results

This chapter presents the outcomes of CIF generation, model performance, and eval-
uation metrics.
All experiments presented in this chapter were conducted on the Alvis A100 GPU
[27], which provided the computational resources required to train and evaluate the
transformer-based models used in this work. This section presents the results of
model training and architectural comparison and explains the rationale behind the
final model choice used for structure generation and evaluation.

4.0.1 Model Architecture Exploration and Selection
The first set of experiments focused on analyzing the impact of model depth on
training behavior and overall performance. Three transformer architectures with 2,
4, and 8 layers were initially considered. The objective was to investigate whether
increasing model depth leads to improved learning or better generation quality for
crystal structure data.
During training, the 8-layer transformer model could not complete the intended
number of training iterations. The increased depth resulted in substantially higher
GPU memory usage, which exceeded the available memory on the Alvis A100 during
training. Consequently, the training process terminated prematurely. Since this
model could not be trained under the same conditions as the others, it was excluded
from further evaluation.
Both the 2-layer and 4-layer models were successfully trained without memory-
related issues. Their training behavior was analyzed by comparing the loss curves
over training iterations. The loss plots show that all models follow a similar conver-
gence pattern, with no clear improvement in convergence speed or final loss value
when increasing the number of layers.
These observations indicate that increased model depth does not provide a clear
advantage for the given dataset and task. Considering training stability, memory
efficiency, and comparable learning behavior, the 2-layer transformer model was
selected for all subsequent experiments. This model was, therefore, used for CIF
generation, checkpoint-based evaluation, and all further analyzes presented in this
chapter.

4.0.2 Model Configuration and Training Setup
The final model used for CIF generation was a transformer with 2 layers and 2
attention heads. The embedding dimension was set to 512, and the block size was

27


4. Results

1750 tokens. This allowed the model to handle long CIF sequences that contain
structural information, lattice parameters, and atomic positions. The model had
approximately 25.6 million trainable parameters, which were large enough to learn
complex structural patterns while still fitting within the available GPU memory.

Training was performed with a batch size of 16. The AdamW optimizer was used
because it is well-suited for transformer training and provides stable learning. The
initial learning rate was set to 1 × 10−3 and was gradually reduced to a minimum
learning rate of 1× 10−5 as training progressed. A warm-up phase of 100 iterations
was used at the beginning of training to make the optimization process more stable
and to prevent large updates in the early stages.

The optimizer beta values were set to (0.9, 0.98), which control how past gradients
are averaged during training. The model was trained for a maximum of 50,000 itera-
tions. During this process, checkpoints were saved regularly so that CIF generation
and evaluation could be performed at different stages of learning.

This configuration provided a good balance between model size, learning ability, and
memory efficiency, which made it suitable for learning CIF structure patterns.

4.0.3 CIF Generation and Checkpoint Analysis

CIF files were generated from each of the 100 saved checkpoints. During analysis, the
model generally produced valid atomic positions and structural features. However,
after the final atomic position entries in many generated CIF files, invalid characters
occasionally appeared. These trailing regions were ignored during evaluation to
ensure that the analysis reflected only the meaningful structural information.

The generation success rate was tracked across checkpoints to observe how the pro-
portion of valid CIF files evolved during training. In the early stages, the success rate
was relatively low, with the minimum value around 50%. As training progressed,
the model became more stable and a clear improvement was observed. After ap-
proximately 25,000 iterations, the success rate consistently remained above 90%,
with the highest recorded value reaching 99.4%. This trend suggests that the model
gradually learned to generate structurally valid CIF files more reliably as training
continued.

28


4. Results

Figure 4.1: Percentage of valid CIF files generated at each checkpoint during
training. The success rate shows a steady improvement, exceeding 90% after around
25,000 iterations and reaching a maximum of 99.4%, indicating increased generation
stability over time.

4.0.4 OTC and Energy Range and Distribution Analysis
The generated CIF files were analysed with a focus on their OTC and energy values.
The distributions of these properties from the generated structures closely resemble
those from the raw test data. Both the range and the spread of values are well
preserved, indicating that the model learned to generate outputs within physically
meaningful limits and avoided producing extreme or unrealistic values.
Figure 4.2 shows the overall distribution of OTC and energy values, confirming
that the generated data covers a similar range as the ground truth. Figure 4.3
compares generated and raw values in a scatter plot, revealing a surprisingly linear
pattern. Although a more scattered relationship was initially expected, the linear
trend indicates that the model maintains a consistent relationship between OTC
and energy.
Overall, the generated values generally follow the range of the raw data. The ob-
served linear pattern in the scatter plot is noted as a point for further consideration
rather than a definitive conclusion about the model’s behavior.

Figure 4.2: Distribution of OTC and energy values for raw test data and generated
CIF files.

29


4. Results

Figure 4.3: Scatter plot comparing raw and generated OTC and energy values.

4.0.5 Atomic Position Error Analysis

To evaluate how accurately the model predicts atomic positions, the Mean Absolute
Error (MAE) and Mean Squared Error (MSE) were calculated by comparing the
generated structures with the reference structures. These metrics were computed
across all atomic coordinates and then plotted to observe the overall behavior.

The results show noticeable but moderate differences between the predicted and true
atomic positions. This means the model does not reproduce the exact coordinates
perfectly. However, this is expected, as predicting full atomic structures is a highly
complex task involving strict spatial and chemical constraints. Despite the presence
of errors, their magnitude remains within a reasonable range. This indicates that the
generated structures are still physically meaningful and can provide useful guidance
for materials design and structural exploration, even if they are not precise enough
for exact structural determination.

30


4. Results

Figure 4.4: Mean Absolute Error (MAE) and Mean Squared Error (MSE) of pre-
dicted atomic positions compared to reference structures. The plot shows that,
although deviations exist, the overall error remains within a reasonable range, indi-
cating physically meaningful structure generation.

4.0.6 Element-Level Generation Behavior
To better understand the chemical factors influencing generation quality, the pres-
ence of individual elements was analyzed in both well-generated and poorly-generated
CIF structures. The goal was to determine whether certain elements are more fre-
quently associated with high-quality generations or with structural deviations. This
was quantified using a preference ratio, which measures how often an element ap-
pears in high-quality structures relative to deviated ones.
The analysis shows that some elements are strongly associated with well-generated
structures. These include yttrium (Y), lanthanum (La), samarium (Sm), strontium
(Sr), barium (Ba), and cobalt (Co). For example, yttrium (Y) appears in good
generations about 94% of the time, lanthanum (La) about 91%, and samarium
(Sm) about 88%. This suggests that the model handles structures containing these
elements more reliably. A possible reason is that these elements occur in more
regular or well-represented structural environments in the training data.
In contrast, some elements are more frequently linked to deviations. Zirconium
(Zr) appears in poorly-generated structures in about 85% of its occurrences, while
chromium (Cr) and molybdenum (Mo) are found in deviated generations more than
83% of the time. Tin (Sn) shows a similar trend. This may indicate that structures
containing these elements are more complex, less common in the dataset, or involve
coordination environments that are more difficult for the model to learn.
Other elements, including iron (Fe), nickel (Ni), calcium (Ca), titanium (Ti), tung-
sten (W), silver (Ag), potassium (K), and thallium (Tl), show a more balanced
behavior. They appear in both well-generated and deviated structures at similar
rates, suggesting moderate and stable model performance for these chemical envi-
ronments.

31


4. Results

The element-wise trends are visualized in Figures 4.5 and 4.6, which show how
structural deviation and energy-related differences vary across elements. These plots
highlight that generation quality is not only a modeling issue but also closely linked
to the underlying chemistry of the elements involved.

Figure 4.5: Element-wise analysis of structural deviation (OTC-related metric).

Figure 4.6: Element-wise analysis of energy-related differences between well-
generated and deviated structures, illustrating how generation quality varies de-
pending on chemical composition.

32


5
Conclusion

This work demonstrates that the Transformer model can successfully generate CIF
files, highlighting its potential as a tool for materials design. The model is capable
of learning structural patterns and producing crystal structures that are chemically
reasonable rather than random, which is promising for complex systems like high-
entropy oxides.
However, further analysis is necessary to fully assess the quality and correctness
of the generated structures. While the outputs generally follow meaningful pat-
terns, it remains important to determine whether the model is truly generating
novel structures or primarily reproducing patterns observed in the training data.
Understanding this distinction between learning and memorization is a key area for
future work.
The results also indicate that the training data contains many similar structures.
This limits the diversity of the generated outputs, suggesting that expanding the
dataset to include more varied compositions could help the model explore a wider
chemical space and improve the novelty of generated structures.
In addition, alternative sampling techniques could be explored during generation to
enhance diversity while maintaining realistic and stable structures. Adjustments in
the way outputs are selected may allow the model to produce a broader range of
chemically valid materials.
Overall, the study shows that the Transformer-based approach is promising for gen-
erating crystal structures, but additional experimentation, validation, and optimiza-
tion are required before it can be confidently applied to the discovery of new high-
entropy oxide materials.

33


5. Conclusion

34


Bibliography

[1] J. Hildingsson, “Material exploration through active learning: A method to
explore compositional space and find oxygen carriers for chemical looping ap-
plications, 2024.

[2] Y. De Vos et al., “Development of stable oxygen carrier materials for chemical
looping processes—A review,” Catalysts, vol. 10, no. 8, 926, 2020.

[3] A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia,
D. Gunter, D. Skinner, G. Ceder, and K. A. Persson, “The Materials Project:
A materials genome approach to accelerating materials innovation,” APL Ma-
terials, vol. 1, no. 1, 011002, 2013.

[4] S. R. Hall, F. H. Allen, and I. D. Brown, “The crystallographic information file
(CIF): a new standard archive file for crystallography,” Acta Crystallographica
Section A, vol. 47, pp. 655–685, 1991.

[5] I. D. Brown and B. McMahon, “The Crystallographic Information File (CIF),”
Data Science Journal, vol. 5, pp. 174–177, 2006.

[6] International Union of Crystallography (IUCr), “CIF standard specification,”
IUCr Resources, 1991.

[7] K. Momma and F. Izumi, “VESTA 3 for three-dimensional visualization of
crystal, volumetric and morphology data,” Journal of Applied Crystallography,
vol. 44, pp. 1272–1276, 2011.

[8] A. Lyngfelt, Chemical-looping combustion of solid fuels – Status of development.
Fuel, 83, pp. 1459–1473, 2004.

[9] A. Lyngfelt, “Chemical looping combustion,” in Greenhouse Gas Issues.
[10] J. Brorsson, H. K. Moberg, J. Hildingsson, J. Gastaldi, T. Mattisson, and A.

Hellman, Data-Efficient Design of High-Entropy Oxygen Carriers for Chemical
Looping Using Active Learning, ACS Materials Au, 2026.

[11] J. Brorsson, H. K. Moberg, J. Hildingsson, J. Gastaldi, T. Mattisson,
and A. Hellman, “Material exploration through active learning – METAL,”
arXiv:2601.03933, 2026.

[12] I. Adánez-Rubio et al., “Use of a high-entropy oxide as an oxygen carrier for
chemical looping,” Energy, vol. 298, 131307, 2024.

[13] A. Sarkar et al., “High entropy oxides for reversible energy storage,” Chemistry
literature, 2018.

[14] C. Riley et al., “A High Entropy Oxide Designed to Catalyze CO Oxidation.”
[15] R. Pascanu, T. Mikolov, and Y. Bengio, On the Difficulty of Training Recurrent

Neural Networks, ICML, 2013.
[16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language

Models are Unsupervised Multitask Learners,” OpenAI Technical Report, 2019.

35


Bibliography

[17] R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare
Words with Subword Units,” ACL, 2016.

[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, Attention Is All You Need, Advances in Neural
Information Processing Systems, 2017.

[19] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
[20] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,”

arXiv:1607.06450, 2016.
[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of

Deep Bidirectional Transformers for Language Understanding, NAACL, 2019.
[22] T. Shen, “Chemical looping combustion performance of high-entropy oxide oxy-

gen carriers,” Journal Article, 2025.
[23] J. Yang, “A comprehensive review of high entropy oxides: unique properties

and applications,” SciOpen, 2025.
[24] I. Loshchilov and F. Hutter, Decoupled Weight Decay Regularization, ICLR,

2019.
[25] S. Gražulis et al., “Crystallography Open Database (COD): An open-access

collection of crystal structures,” Nucleic Acids Research, 2012.
[26] F. L. Johansen et al., “deCIFer: Crystal Structure Prediction from Powder

Diffraction Data using Autoregressive Language Models,” arXiv:2502.02189,
2025.

[27] Chalmers e-Commons, C3SE, & NAISS, Alvis: National AI/ML GPU Cluster
with NVIDIA A100 GPUs, NAISS Resource Information, 2026. Available at:
https://www.naiss.se/resource/alvis/

36

https://www.naiss.se/resource/alvis/


DEPARTMENT OF SOME SUBJECT OR TECHNOLOGY
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Acronyms
	List of Figures
	List of Tables
	Introduction
	Background
	Problem Description
	Research Questions
	Scope and Limitations

	Theory
	Crystal Structures and CIF Files
	Chemical Looping and Oxygen Carriers
	Transformer Model: Architecture, Mechanisms, and Relation to CIF Generation
	Introduction
	Sequence Modeling and the Next-Token Prediction Task
	Tokenization of Input Sequences
	Embedding and Positional Encoding
	Token Embedding
	Positional Encoding
	Attention Mechanism
	Multi-Head Attention
	Feed-Forward Layers, Residual Connections, and Layer Normalization
	Autoregressive Generation and Masked Attention
	Next-Token Prediction Objective
	Output Layer and Token Prediction
	Training Objective: Cross-Entropy Loss
	Autoregressive Generation of CIF Files
	Summary of the CIF Generation Workflow
	Model Architecture and Hyperparameters
	Vocabulary and Tokenization Challenges in CIF Data

	Computational Complexity of Transformer Models
	Overview of deCIFer

	Methods
	Dataset Overview and Exploration
	Preprocessing 
	Tokenization and Dataset Splitting

	Training Setup and Conditional Integration
	Conditional Features and Objective
	Dual Conditioning Approach
	Encoder Conditioning
	Prefix Conditioning

	Precomputed Conditional Embeddings
	Batch Size and Block Size
	Transformer Forward Pass
	Prediction and Loss
	Parameter Update
	Training Pipeline Summary
	Checkpointing in Training
	Purpose of Checkpoints
	Contents of a Checkpoint

	CIF Generation


	Results
	Model Architecture Exploration and Selection
	Model Configuration and Training Setup
	CIF Generation and Checkpoint Analysis
	OTC and Energy Range and Distribution Analysis
	Atomic Position Error Analysis
	Element-Level Generation Behavior


	Conclusion
	Bibliography