High-throughput Low-power BCH Decoder
for Short-reach Optical Communication
System

Master’s thesis in Embedded Electronic System Design

XU WANG

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2023


Master’s thesis 2023

High-throughput Low-power BCH Decoder for
Short-reach Optical Communication System

XU WANG

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2023


High-throughput Low-power BCH Decoder for Short-reach Optical Communication
System
XU WANG

© XU WANG, 2023.

Chalmers advisor: Lars Svensson, Computer Science and Engineering
Company advisor: Christoffer Fougstedt, Ericsson
Examiner: Per Larsson-Edefors, Computer Science and Engineering

Master’s Thesis 2023
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Unrolled pipeline riBM BCH decoder

Typeset in LATEX
Gothenburg, Sweden 2023

iv


High-throughput Low-power BCH Decoder for Short-reach Optical Communication
System
XU WANG
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Short-reach optical systems in data centers or high-performance computing systems
are often subjected to harsh environmental conditions, especially high temperatures.
These elevated temperatures can lead to various challenges, including high bit-error
rates (BER) and lower data rates. In order to mitigate those problems, forward error
corrections (FEC) can be employed. A suitable FEC can also help to reduce the total
system power by relaxing the optical modulation amplitude (OMA) requirement at
the receiver. FEC consists of two functional blocks, an encoder and a decoder. Com-
pared with the encoder, the decoder significantly influences area usage and power
consumption of the FEC. Among FEC alternatives, Bose–Chaudhuri–Hocquenghem
(BCH) is favorable regarding the hardware complexity and error correction ability.
In this work, we propose a BCH decoder circuit with error correction ability t = 3
and 4, and implement its corresponding encoder in order to evaluate the overall
system power-dissipation reduction. We also propose an estimation model based
on MATLAB to predict the performance of different system architectures and key
equation solving algorithms. Based on the estimation result, a reformulated inver-
sionless Berlekamp-Massey (riBM) algorithm is selected for the key equation solver
and the unrolled pipeline (UP) system architecture is implemented to increase the
throughput and reduce the power consumption. A system based on iterative parallel
(IP) architecture is also implemented as a comparison architecture. Compared with
the IP system, the UP system can save up to 20% power at an input BER of 10−3 in
a 22-nm CMOS process technology. Including the power consumption of encoders
and actual-case transmitters, the BCH FEC can significantly reduce the power by
a maximum of 58% compared with the uncoded system in a 190-mW transmitter
system.

Keywords: optical fiber system, forward error correction, lower power, high through-
put, Bose–Chaudhuri–Hocquenghem (BCH), very large-scale integration (VLSI).

v


Acknowledgements
I am grateful to my parents for their unwavering support in enabling me to pursue
my master’s degree. I would also thanks to my Chalmers supervisor, Prof. Lars
Svensson, who gives me this thesis opportunity and insight supervision. I would like
to acknowledge the invaluable assistance and mentorship provided by Dr. Christoffer
Fougstedt from Ericsson, who has been instrumental in helping me navigate and
overcome numerous challenges and sharing his experiences which is crucial in adding
depth and richness to the thesis work. I would like to extend my sincere appreciation
to my examiner, Prof. Per Larsson-Edefors for his constructive suggestions and
meticulous examination.

Xu Wang, Gothenburg, June 2023

vii


Contents

1 Introduction 1
1.1 Purposes and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Technical Background 3
2.1 Fiber Communication System . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Galois Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.4 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.5 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Forward Error Correction . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Reed-Solomon Code . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 BCH code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5.1 Syndrome-Table Decoding . . . . . . . . . . . . . . . . . . . . 8
2.5.2 Algebraic decoding . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5.2.1 Berlekamp-Massey Algorithm . . . . . . . . . . . . . 11
2.5.2.2 Peterson Algorithm . . . . . . . . . . . . . . . . . . . 11

2.5.3 Chien Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Methods 15
3.1 Bit-operation FEC Simulator . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Estimation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 VHDL Code Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Auto-synthesis System . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 BSC Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Design 19
4.1 Power Evaluation System . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Galois Field Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Berlekamp-Massey Decoder . . . . . . . . . . . . . . . . . . . . . . . 23

4.3.1 Syndrome Calculation . . . . . . . . . . . . . . . . . . . . . . 24

ix


Contents

4.3.2 Key Equation Solver . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.3 riBM Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.4 Parallel Chien Search . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.1 Direct Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.2 Parallel Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.3 Unrolled Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.4 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Results 39
5.1 GF Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Syndrome Component . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Key Equation Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3.1 t = 3 riBM Component . . . . . . . . . . . . . . . . . . . . . . 41
5.3.2 t = 4 riBM Component . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Chien Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 BSC Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Post-FEC Power Saving . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion 47

Bibliography 49

x


Acronyms

BCH Bose-Chaudhuri-Hocquenghem.
BER bit error rate.
BM Berlekamp-Massey.
BSC Binary symmetric channel.

CMOS complementary metal–oxide–semiconductor.
CRT Chinese remainder theorem.
CS-RiBM compensated simplified RiBM.

DI Direct iteration.

FDSOI Fully depleted silicon on insulator.
FEC Forward error correction.
FIFO First-In-First-Out.

GF Galois Field.

HD Hard-Decision.

iBM Inversionless BM.

LCM Least common multiple.
LDPC Low-Density Parity-Check.
LFSR Linear feedback shift register.
LUT Look-Up table.

MUX Multiplexer.

OMA optical modulation amplitude.

PC Personal computer.
PE0 Processor engine zero.
PE1 Processor engine one.
PI Parallel iteration.

riBM Reformulated inversionless BM.
RS Reed-Solomon.
RTL Register-transfer level.

SD Soft-Decision.

xi


Acronyms

TiBM Truncated iBM.

UP Unrolled pipeline.

VCD Value change dump.
VCSEL Vertical-cavity surface-emitting laser.
VHDL Very high-speed integrated circuit hardware description language.

XOR Exclusive OR.

xii


1
Introduction

Short-reach interconnections are commonly used in network applications, such as
data centers, and in connecting equipment that needs high-bandwidth links, such as
front-haul links in communication networks. Vertical-cavity surface-emitting lasers
(VCSELs) are a popular choice for such systems [1] due to their ability to be directly
modulated at high speeds, which enables compact transceiver modules with small
footprints and energy- and cost-efficient interconnects. Even with advanced cooling
techniques, the operating environment for the VCSELs can reach a temperature
above 70 °C [2] and even worse in an automotive system (maximum 125 °C). In
addition, data centers and high-performance computers often necessitate supercom-
puting capabilities, which in turn rely on high bandwidth. Hence, those systems not
only require a high throughput but also reliability at a high temperature.

Forward error correction (FEC) is a widely-used error correction technique family of
detecting and correcting errors during transmission. The FEC adds redundant bits
to the original data, which allows the receiver to detect and correct errors without
requesting the transmitter to resend the data. FEC is commonly used in memory
systems [3, 4, 5] to correct the random errors and burst errors that may occur
during data reading and writing and improve data storage reliability. Moreover, in
communication links, FEC can help to decrease the random errors added by the
transmission medium, and push the data rate further without negatively impacting
the bit error rate (BER) [6].

Adding FEC to the fiber-optic communication system may introduce hardware over-
heads, such as more area usage and power consumption; however, a suitable FEC
system may reduce the overall system power, offering overall power saving for the
total system [7]. Incorporating FEC can relax the optical modulation amplitude
(OMA) requirement at the receiver while maintaining the same BER. Relaxation on
the receiver reduces the power consumption at the transmitter, resulting in a power-
efficient operation point. Szczerba et al. [7] developed an FEC based on Hamming
code in a 65-nm complementary metal–oxide–semiconductor (CMOS) process, re-
sulting in a 23% maximum power reduction. Fougstedt et al. [8] demonstrated a
low-power Bose-Chaudhuri-Hocquenghem (BCH) FEC in a 28-nm CMOS process
technology which offered a 25% total power saving with an error correction ability of
two. In a harsh environment, such as high temperature, an error correction ability
below two may be insufficient, thus an FEC with a higher error correction ability is
required.

1


1. Introduction

Regarding the latency, BCH code is a suitable candidate with moderate latency and
it also offers a lower complexity compared with symbol based Reed-Solomon (RS),
which can correct multiple errors but with higher latency. Hamming codes are indeed
the low-latency option, but they can only correct one error which is insufficient for
meeting the high error correction requirements. Therefore, BCH codes stand out
as a probable option for applications where both low latency and high correction
abilities are critical requirements.

1.1 Purposes and Goals
The purpose of this thesis work is to improve the reliability and throughput of short-
reach optical communication systems operating in high-temperature environments
by incorporating low-power FECs. By integrating low-power FECs, the systems
can operate at a power-efficient point and mitigate transmission errors, ensuring
reliability in high-temperature conditions.
The final goal is to implement a BCH decoder based on the Berlekamp-Massey (BM)
algorithm with the error correction ability of three and four with low latency and a
throughput of 56 Gbits/sec. The final goal can be divided into five sub-goals:

• Setting up a model comparing the uncoded BER (without FEC) and post-FEC
BER (with FEC) to evaluate the power relaxation offered by the FEC.

• Implementing an efficient Galois Field (GF) multiplier component considering
the area usage and timing, and evaluating the performance regarding different
primitive polynomials.

• Implementing a suitable key equation solver algorithm with error correction
ability of three and four on a 22-nm CMOS technology considering the timing
performance, hardware usage, and power efficiency.

• Evaluating the performance considering the power relaxation for the total
system, latency, and throughput.

• Appraising and comparing the performance of different error-correction abili-
ties and methods.

1.2 Thesis Outline
In this thesis report, Chapter 2 provides technical backgrounds to the readers, in-
cluding introductions to a simple fiber system, GF arithmetic, common FEC codes,
and an FEC system based on binary BCH codes. In the following Chapter 3, the de-
sign methods are introduced from the Matlab models to very high-speed integrated
circuit hardware description language (VHDL) codes. Then, Chapter 4 finalizes the
design of this thesis project by comparing different GF multipliers, different key-
solver typologies, and system architectures. Chapter 5 demonstrates the synthesis
and implementation results from each individual component to different decoder
structures. In the end, Chapter 6 summarizes the results.

2


2
Technical Background

This chapter introduces the technical background of the fiber communication system,
the GF arithmetic, the common FEC codes, and an FEC system based on binary
BCH codes.

2.1 Fiber Communication System
A fiber communication system transmits information from the transmitter to the
receiver through an optical fiber. The message is carried by the infrared light gen-
erated from a laser, typically VCSEL, which offers high speed, improved power, and
spectral proprieties at a low cost. A simple fiber system is shown in figure 2.1.

Figure 2.1: A simplified fiber system

The transmitter converts electrical signals into an optical format and transmits
them over fiber optic cables. The specific components present in the transmitter
are determined by the system’s requirements. A typical transmitter includes a data
generator, an FEC encoder, a driver, and a laser. The data generator produces the
transmitted messages in an electrical format, while the driver converts the voltage
to current to drive the laser. The laser then emits infrared light based on the current
input. FEC is commonly utilized in long-haul transmissions but is gaining increasing
interest in short-haul systems, given their capacity to enhance system reliability [9]
and reduce power consumption [7]. A modulator is often added to the transmitter
depending on the throughput requirement.
The receiver in turn converts the optical signals back to electrical signals. It may
also need to correct errors introduced by the channel. Just like the transmitter, the
complexity of the receiver can vary, but generally, it comprises a photo-diode, an
amplifier, an FEC decoder, and a digital receiver. Based on the modulation format
of the transmitted message, a demodulation component may also be required to
demodulate the message. The photo-diode receives the optical signal and transforms

3


2. Technical Background

it into an electrical current, which is then amplified by the following amplifier. The
FEC decoder then corrects any errors in the message, and the data from the decoder
is sent to a receiving terminal, such as a personal computer (PC) interface.

2.2 Galois Field Arithmetic
A Galois field or finite field is a field that contains a finite number of elements [10]. A
Galois field (GF) contains a prime number p or a prime power pm number of elements,
and the GF with pm elements can also be referred to as GF (pm). The GF (2)
is the simplest finite field, which plays an important role in communication and
storage systems. There are also addition, subtraction, multiplication, and division
operations in this field but following different rules compared with the real field. To
better understand the GF field, some important definitions and representations are
provided first and then the addition and multiplication in this field are explained.

2.2.1 Group
A group is a set of elements G equipped with an operation · that satisfies the
following properties:

• Closure: for any a ∈ G, b ∈ G, the result of a · b is also belong to G.
• Associative: for any a, b, c ∈ G, (a · b) · c = a · (b · c).
• Identity: There is an identity element e ∈ G where for all a ∈ G, a·e = e·a = a.
• Inverse: For any a ∈ G, there is an inverse a−1 ∈ G such that a · a−1 = e.

If the operation · is commutative, then the group is commutative or abelian.

2.2.2 Field
If the group is commutative and with two operations, additive operation + and
multiplicative operation ·, such that:

• Distributive law: for all (a + b) · c = (a · c) + (b · c).
then this group is a field F .

2.2.3 Polynomial
The elements in GF can be represented in two ways, power basis and standard basis
(polynomial basis). The standard basis representation is used in this work and is
constructed by a primitive polynomial. Below are definitions needed to understand
the primitive polynomial:

• Primitive polynomial: a primitive polynomial is a minimal polynomial associ-
ated with a primitive element of GF.

• Minimal polynomial: a minimal polynomial is the lowest-degree irreducible
polynomial that, when evaluated with the primitive element, results in zero.

4


2. Technical Background

• Irreducible polynomial: If a polynomial over GF cannot be factored into lower-
degree polynomials, then it is irreducible.

2.2.4 Addition
Addition and subtraction are straightforward. These operations are performed by
first adding or subtracting the two polynomials, if the element representation is
polynomial, and then modulo the primitive number p. Assume the two addends
are a and b with the polynomial a(α) = a0 + a1α + . . . + aiα

m−1 and b(α) =
b0 + b1α + . . . + biα

m−1, respectively. Then the addition can be performed by:

c(α) = (a(α) + b(α))mod p, (2.1)

when p is two, the addition and subtraction can be simplified to exclusive OR (XOR)
gates, and the formula 2.1 can be simplified to:

c(α) = a(α) ⊕ b(α), (2.2)

where ⊕ denotes the XOR operation.

2.2.5 Multiplication
The multiplication in GF is relatively complex. Multiplication in a finite field is
multiplication following a modulo operation with a primitive polynomial p(x) used
to define this field. Assume the same operators as the addition a and b, and the
p(x) with a polynomial of p(x) = p0 + p1x + . . . + pix

i−1, where 0 ≤ i ≤ n. The
multiplication denotes as:

c(α) = (a(α) · b(α))mod p(x) (2.3)

which can be rewritten as:

c(α) = b0a(α)mod p(x) + b1[a(α)αmod p(x)] + . . . + bm−1[a(α)αm−1
mod p(x)] (2.4)

Many architectures [11, 12] have been proposed based on formula 2.4 to improve
the hardware efficiency. Mastrovito multipliers were proposed in [13] to reformulate
formula 2.4, aiming at improving the throughput, and different optimized Mastrovito
multipliers were proposed [14, 15] to improve the performance.

2.3 Forward Error Correction
FEC is a type of error correction method and is commonly used in communication
systems and memory systems to increase the reliability and efficiency of the systems.
FEC typically comprises an encoder and a decoder. The encoder appends redundant
bits calculated by mathematical algorithms, facilitating the decoder to decode the

5


2. Technical Background

message. According to the error patterns, the decoder detects and corrects the
errors in the contaminated message. There are different types of FEC codes and
their decoding algorithms can be divided into Hard-Decision (HD) and Soft-Decision
(SD) [16]. HD is typically more power efficient and less complex compared with SD
since the HD quantizes the message before decoding it [6]. However, HD offers
lower coding gain, the improvement of signal-to-noise-ratio achieved by FEC over
an uncoded system while maintaining the same BER, compared with its counterpart.
Among the HD codes, Hamming, BCH, and RS are conventional codes. Although
there are more advanced codes such as Low-Density Parity-Check (LDPC) codes
and concatenated BCH code, which aim at higher coding gain and throughput, only
the simple algebraic codes, Hamming, BCH, and RS are considered in this thesis.

2.3.1 Hamming Code
Hamming code is a type of linear error-correcting code, and it can only correct a
signal bit error per block [17]. Hamming code is in fact a special-case BCH code with
the correction ability equaling one, t = 1. Hamming code is the most lightweight
code regarding complexity since a simple Look-Up table (LUT) can be used to
decode the received codes. Hamming code can be denoted as (n, k), where n is the
length of the codeword and k is the length of the message, and n − k = m · t, where
n = 2m−1 and t is the error correction ability. There are different types of Hamming
codes regarding the codewords, a simple (7, 4) code appends four bits message with
three bits parity check bits to become a seven-bit codeword. While Hamming codes
are the most lightweight, the error correction ability is limited to one. Therefore
for the system with higher error correction ability requirements, BCH codes and RS
codes are used.

2.3.2 Reed-Solomon Code
Reed-Solomon (RS) codes are also linear error-correcting codes and are commonly
used in applications where high levels of error correction are required. Different
from the BCH code and Hamming code, the RS code is based on symbols [18]
rather than bits. RS code can be denoted as RS(n, k, t), where 2 · t = n − k and
n = 2m −1, and it can correct t symbol errors. To be more specific, an RS code with
t error correction ability can correct at least t bit errors and at most mt bit errors.
Because of the symbol operation, the RS codes are efficient at correcting burst
errors, but the encoding and decoding of the RS code are more complex than the
BCH and Hamming codes. RS codes are widely used in long-haul communication,
and RS(544, 514, 15) and RS(528, 514, 7) are recommended in the current Ethernet
standard [19].

2.3.3 BCH code
Bose-Chaudhuri-Hocquenghem (BCH) is a subclass of cyclic error-correcting codes
and is constructed over a Galois field [10]. Binary BCH codes are commonly used in
HD decoding because of their low complexity and moderate error correction ability.
BCH code can be represented by BCH(n, k, t) with t error correction ability, the

6


2. Technical Background

relationship among these parameters is the same as the Hamming code except t ⩾ 1.
The complexity of encoding with t ⩾ 2 is higher than the Hamming code due to the
improved error correction ability, but is relatively lower compared with the RS codes.
The complexity of decoding varies because of different decoding methods, such as
Berlekamp-Massey (BM) [20, 21], and Peterson [22]. BCH codes are of interest in
short-haul communication systems due to the low required error correction ability
of those systems.

2.4 Encoder
The encoder is used to encode the original message based on the code type. In
this section, a BCH encoding process is illustrated. The encoding begins by finding
the minimum-degree generator polynomial g(x) over GF(2). The g(x) is the least
common multiple (LCM) of the minimal polynomial ϕi(X) of αi (see equation 2.5),
where α is a primitive element in GF(2) and 1 ≤ i ≤ 2t. Since any even integer
number can be expressed as a product of an odd integer and a power of 2, this
equation can be simplified to equation 2.6 [10].

g(x) = LCM{ϕ1(X), ϕ2(X), ..., ϕ2t(X)}, (2.5)
g(x) = LCM{ϕ1(X), ϕ3(X), ..., ϕ2t−1(X)}. (2.6)

There are two kinds of encoding methods: one is non-systematic encoding, where
the parity-check bits are factors of the codeword, and the other one is systematic
encoding, where the redundancy is a prefix of the message. The latter one is preferred
since the message can be directly read out. Systematic encoding can be implemented
by [23]:

c(x) = m(x) · xmt + Rem(m(x) · xmt)g(x), (2.7)

where c(x) is the coefficients (c0, c1, ..., cn) of codeword, m(x) is the coefficients
(m0, m1, ..., mk) of message, and the g(x) is the coefficients (g0, g1, ..., gn−k) of the
generator polynomial. From equation 2.7, the encoding is actually a remainder
calculation process, which can be performed using a linear feedback shift register
(LFSR) with low throughput and a fanout bottleneck.
Parallelism can help to increase the throughput by utilizing a generator matrix
G, which is derived from the generator polynomial g(x) [24]. Using the generator
matrix, the encoding becomes more straightforward:

c(x) = m(x) · G. (2.8)

Since the multiplication of binary bits can be simplified to the connection or dis-
connection of wires and addition is XOR gate, the encoder is composed of XOR
trees.

7


2. Technical Background

2.5 Decoder
Compared with the encoder, the decoder is more complex. The decoder detects and
corrects the error according to the error patterns. There are two types of decoding
methods that are commonly used: syndrome-table decoding and algebraic decoding.
Both decoding methods are described below.

2.5.1 Syndrome-Table Decoding
The syndrome-table decoding comprises syndrome calculation and LUT addressing
progress (see figure 2.2).

Figure 2.2: A typical syndrome table decoder

Assume a generator polynomial g(x) has [α, α2, ..., α2t] and their conjugates as roots,
thus g(αi) = 0 with 1 ≤ i ≤ 2t. Since the c(x) is divisible by g(x), a root of g(x) is
also the root of c(x) [10]. Therefore, a relationship can be found:

c(x) · (1, αi, α2i, ..., α(2m−2)i)T = 0. (2.9)

Since 1 ≤ i ≤ 2t, a 2t × (2m − 1) parity-check matrix can be derived by equation
2.9:

H =



1 α α2 . . . α2m−2

1 α2 (α2)2 . . . (α2)2m−2

1 α3 (α3)2 . . . (α3)2m−2

... ... ... . . . ...
1 α2t (α2t)2 . . . (α2t)2m−2

 . (2.10)

From equation 2.9 it can be concluded that for any c(x), there is a relationship that:

c(x) · H = 0. (2.11)

The transmission media may add interference e(x) to the sent message, and the
received message r(x) will become:

r(x) = c(x) + e(x), (2.12)

8


2. Technical Background

where e(x) = e0+e1x+. . .+e2m−2x
2m−2. Because of the interference, the relationship

of equation 2.11 is disrupted, and the multiplication of r(x) and H becomes the
syndrome S:

S = (S1, S2, ..., S2t) = r(x) · HT , (2.13)

where 1 ≤ i ≤ 2t. From equations 2.12 and 2.11, the syndrome calculation can be
rewritten to

r(x) · HT = (c(x) + e(x)) · HT = e(x) · HT . (2.14)

For the error correction ability t = 1, it is efficient to use syndrome-table decoding
since there are not that many error patterns. The syndrome S can be directly
mapped to the error patterns through a LUT [25]. However, this method is inefficient
for the t > 1 given that the minimum-distance decoding of BCH codes requires
2n−k − 1 syndrome-table entries.

2.5.2 Algebraic decoding
Compared with syndrome-table decoding, algebraic decoding is more efficient when
the error correction ability is more than one. The algebraic decoding also begins with
a syndrome calculation to get the syndrome S. The next step is to use the syndrome
to determine the error-location polynomial Λ(x) from the Newton identities. This
process is also called key-equation solving (see figure 2.3).

Figure 2.3: A typical algebraic decoder, KES is the key equation solver

There are different types of algebraic algorithms to solve the key equation, the
commonly used ones are Berlekamp-Massey (BM) algorithm [20, 21] and Peterson’s
algorithm [22], which are illustrated below. After the error location polynomial is
found, the Chien search [26] is applied to determine the roots and error pattern. To
retrieve the original message, the error pattern is added to the received code.
According to equation 2.11 and 2.12, when there are errors added by the transmission
media, equation 2.11 is not satisfied. Because c(αi) = 0, equation 2.13 can be
simplified to:

Si = e(αi). (2.15)

Let us assume the errors happen at the locations j1, j2, ..., jz, where 0 ≤ j1 < j2 . . . <
jz < n. Then the error polynomial becomes:

9


2. Technical Background

e(x) = xj1 + xj2 + . . . + xjz . (2.16)

And the syndrome becomes:

S1 = e(α) = αj1 + αj2 + · · · + αjz

S2 = e (α2) = (αj1)2 + (αj2)2 + · · · + (αjz)2

...
S2t = e (α2t) = (αj1)2t + (αj2)2t + · · · + (αjz)2t

.

(2.17)

Thus with the relationship between syndrome and error location, the problem be-
comes how to determine the error locations using the syndrome. Let us simplify the
above set of equations by assuming:

βl = αjl . (2.18)

So equation 2.17 can be optimized to:

S1 = β1 + β2 + · · · + βz

S2 = β2
1 + β2

2 + · · · + β2
z

...
S2t = β2t

1 + β2t
2 + · · · + β2t

z .

(2.19)

The above set of equations is nonlinear thus solving it directly will be very difficult.
But we can solve it in an indirect way, first assuming the error-location polynomial:

Λ(x) = (1 + β1x)(1 + β2x) . . . (1 + βzx), (2.20)
= Λ0 + Λ1x + . . . + Λzxz, (2.21)

which has β−1
1 , β−1

2 , ..., β−1
z as roots, where

Λ1 = β1 + β2 + · · · + βz

Λ2 = β1β2 + β1β3 + · · · + βz−1βz,
Λ3 = β1β2β3 + β1β2β4 + · · · + βz−1βz−2βz−3,
...
Λz = β1β2 · · · βz.

(2.22)

Then the problem becomes determining the coefficients of the error location polyno-
mial Λ(x), then according to the error location polynomial, the roots can be founded.
There are different methods to find the coefficients of the error location polynomial
Λ(x), such as BM and Peterson’s algorithm, but the theory is the same. The roots
can be determined by solving the Newton identities:

S1 + Λ1 = 0
S2 + Λ1S1 + 2Λ2 = 0
S3 + Λ1S2 + Λ2S1 + 3Λ3 = 0
...
Sz + Λ1Sz−1 + Λ2Sz−2 + . . . + Λz−1S1 + zΛz = 0.

(2.23)

10


2. Technical Background

If we can solve the Newton identities, then we can find the coefficients of the error
location polynomial and in the end the roots. In general, there may be more than
one error pattern that satisfies the Newton identities [10], so the most-probable one
should be found. In the binary symmetric channel (BSC), the problem becomes
finding the coefficient which satisfies the Newton identities with a minimum degree.

2.5.2.1 Berlekamp-Massey Algorithm

Berlekamp-Massey (BM) algorithm is actually a process to iteratively find the min-
imum degree polynomial that satisfies the Newton identities. It takes 2t times to
verify and adjust the minimum degree polynomial according to the value of discrep-
ancy dk. The calculation of dk is actually a process to verify the first k + 1 Newton
identities 2.23:

dk = Sk+1 + Λ(k)
1 Sk + Λ(k)

2 Sk−1 + . . . + Λ(k)
lk

Sk+1−lk , (2.24)

where lk is the degree of current error correction polynomial, and Λk(x) is the current
minimum degree polynomial which satisfies the first k Newton identities:

Λ(k)(x) = Λ(k)
0 + Λ(k)

1 x + . . . + Λ(k)
lk

xlk , (2.25)

If dk = 0 then it means the current error correction polynomial also satisfies the
(k +1)th identity and there is no adjustment needed for the current error correction
polynomial. However, if the discrepancy is nonzero, which means current Λk(x)
cannot satisfy the (k +1)th identity and a correction term should be added to Λk(x)
and the Λ(k+1)(x) becomes:

Λ(k+1)(x) = Λ(k)(x) + dkdix
k−iΛ(i)(x), (2.26)

where i is a step prior to the k-th step with di ̸= 0 and i − lk has the largest
value. dk and Λ(i)(x) is discrepancy and minimum error location polynomial for the
i-th step. lk increases one when the correction is applied to the Λ(k)(x). The final
minimum error location polynomial becomes the Λ(2t)(x) at the 2t-th iteration. The
BM algorithm can be summed in algorithm 1.
It can be found that the most hardware and time-consuming parts of the BM al-
gorithm are the calculation of discrepancy, and the inversion of di is unfriendly to
the hardware, especially when m is larger. Determining the final Λ(x) spends 2t
cycles, leading to high latency. Many improved algorithms have been proposed to
decrease the latency and hardware overhead, for example, simplified BM decreases
the iteration time to 2t [10] by skipping the odd iterations because the discrepancy
is always zero, and inversionless BM (iBM) removes the inversion of di and leads to
a hardware efficient design [27].

2.5.2.2 Peterson Algorithm

Peterson algorithm [28] can remove the hardware-expensive inversion and reduce the
latency for the binary BCH codes. From equation 2.23, the coefficients of the error

11


2. Technical Background

Algorithm 1: The original Berlekamp-Massey Algorithm for BCH code
input: Si (1 ≤ i ≤ 2t)
initialization: k = −1, Λ(−1)(x) = 1, d−1 = 1, l−1 = 0, k − l−1 = −1

k = 0, Λ(0)(x) = 1, d0 = 1, l0 = 0, k − l0 = 0
begin: for k = 0, 1, 2, ..., 2t − 1

dk = Sk+1 + Λ(k)
1 Sk + Λ(k)

2 Sk−1 + . . . + Λ(k)
lk

Sk+1−lk

if dk = 0 then
Λ(k+1)(x) = Λ(k)(x)
lk+1 = lk

else
Λ(k+1)(x) = Λ(k)(x) + dkd−1

i xk−iΛ(i)(x)
lk+1 = lk + 1

end if
k = k + 1

output: Λ(x) = Λ(2t)(x) when k = 2t

location polynomials and syndromes satisfy the Newton identities. For the binary
BCH code, S2j = S2

j , where 1 ≤ j ≤ 2t and

{
aΛi = 0 if a is even integer,
aΛi = Λi if a is odd integer, (2.27)

where 0 ≤ i ≤ t. Then the even rows in the Newton identities can be skipped if its
previous odd rows are satisfied. For example, if the first row is satisfied S1 + Λ1 = 0
then the second row is

S2 + Λ1S1 + 2Λ2
= S2

1 + S1S1
= 0.

(2.28)

Therefore, the Newton identities can be rewritten and shortened to

A′Λ =



1 0 0 . . . 0 0
S2 S1 1 . . . 0 0
S4 S3 S2 . . . 0 0
... ... ... . . . ... ...

S2t−4 S2t−5 S2t−6 . . . St−2 St−3
S2t−2 S2t−3 S2t−4 . . . St St−1





Λ1
Λ2
Λ3
...

Λt−1
Λt


=



S1
S3
S5
...

S2t−3
S2t−1


, (2.29)

where A′ is the syndrome matrix. If A′ is nonsingular, the set of equations in 2.29
have a unique solution. According to the equation, the error location polynomial
can be directly calculated by the syndrome without iteration, leading to low over-
head. For the small error correction ability, t = 1, 2, the set of equations is pretty
simple. Nevertheless, for the error correction ability, t ≥ 3, the complexity increases
significantly for larger t, resulting in higher power consumption and a longer critical
path.

12


2. Technical Background

2.5.3 Chien Search

From the error location polynomial Λ(x), the roots, β−1
1 , β−1

2 , ..., β−1
v , can be deter-

mined by the Chien search [26]. This procedure is based on the cyclic property of
cyclic codes and substitutes the roots α, α2, ..., αn one by one to verify the satisfac-
tion of Λ(x). Assuming the received error location polynomial Λ(x) is

Λ(x) = Λ0 + Λ1x + . . . + Λtx
t. (2.30)

It has been proven in [26] that after τk shift, if equation 2.30 satisfies the relationship

Λ(ατk) = Λ0 + Λ1 · ατk + . . . + Λt · (ατk)t = 0, (2.31)

then one error location is αn−τk , the inversion of the root. After n successive trans-
formations, t roots will be found. The transformation process can be implemented
by GF multiplication and addition according to formula 2.31.
There are two approaches to implementing this formula in hardware: one prioritizes
hardware efficiency, albeit with a longer delay, while the other emphasizes timing
efficiency, albeit at the cost of larger area usage. For the hardware efficient one, only
t GF multipliers and t + 1 GF adders are needed. The successive transformation is
performed by iteratively inputting each element of this GF field in an orderly manner
and subsequently evaluating whether the resulting output satisfies the formula 2.31
or not. If the resulting output satisfies the formula, then store the iteration index.
After n iterations, the t roots are identified and found.
Unrolling can be employed to accelerate the Chien search process. The process
starts by constructing an element array E

E =



α α2 . . . αt

(α)2 (α2)2 . . . (αt)2

(α)3 (α2)3 . . . (αt)3

... ... . . . ...
(α)n (α2)n . . . (αt)n

 , (2.32)

which encompasses all the elements required for verification. Then the verification
of 2.31 becomes

Λ0 + [Λ1, Λ2, . . . , Λt] · E = 0. (2.33)

And the index of the root should be stored for the final error correction. By unrolling
the iteration, the Chien search finishes in one clock cycle, but with the cost of nt
multipliers and n(t+1) adders. The final error correction is accomplished by adding
the error pattern coefficients, determined by the error locations, to the received
coefficients.

13


2. Technical Background

14


3
Methods

The development of this design commences with the implementation of a bit-operation
FEC simulator using MATLAB and it is adjusted by an estimation model. The sim-
ulator provides features needed by the VHDL generator, and the VHDL codes are
generated according to those features. VHDL codes are transferred to an auto-
synthesis system to automatically synthesize and grab the results. In the end, the
synthesized register-transfer level (RTL) netlists are delivered to the BSC simula-
tor, and the results of the binary systematic channel (BSC) simulator and evaluation
system are plotted in the plot generator. The overall development flow is shown in
figure 3.1.

Figure 3.1: The overall thesis development flow

3.1 Bit-operation FEC Simulator
An FEC system comprising an encoder and decoder is established in MATLAB to
validate the comprehension underlying those concepts. However, in order to gain a

15


3. Methods

deeper insight into the hardware perspective of GF arithmetic, all the components
are implemented as bit-true operations, in other words, using logic gate operations.
This approach avoids the reliance on GF embedded functions in MATLAB. At first,
the model for each component is implemented in a bit-true operation style and then
combined together to form a cohesive FEC system. To validate the functionality
and accuracy, the results of the bit-true operation model are compared with an
alternative model that utilizes embedded MATLAB functions. To facilitate the
VHDL code generator, the bit-operation FEC simulator also generates the required
feature array, such as G matrix and H matrix. The FEC simulator can be adjusted
depending on the result of the estimation model.

3.2 Estimation Model
To compare the performance of different key equation solver algorithms and system
architectures, an estimation model is implemented using MATLAB. The estimation
model not only can predict the trends of timing and area usage when the error
correction ability t and power m vary but also evaluates the timing and hardware
complexity based on gate counts, using formulas derived from previous papers. Due
to the challenge of obtaining precise performance of different logic gates during
the early design phase, the estimation model can only give an estimation based on
gate counts. The estimation results from the model serve as a basis for improving
the FEC and making informed decisions. By leveraging the insights gained from
the estimation model, the FEC can be refined aiming at low hardware and timing
complexity, and a final decision can be reached regarding these goals.

3.3 VHDL Code Generator
Manually writing all RTL is tedious and cumbersome. Thanks to the regularity of
the key solver algorithm and matrix-based syndrome and Chien search component,
the codes can be produced by a VHDL code generator written in MATLAB. The
VHDL code generator generates the VHDL codes based on the code length, error
correction ability, and system architectures. Specifically, the generator connects and
disconnects the input pins based on the generator matrix and parity-check matrix,
and arranges the GF multipliers and adders according to the system architecture. To
facilitate the functional test of different components and the whole system, testbench
files are also generated by the code generator, using the test vectors produced by
the bit-operation FEC simulator.

3.4 Auto-synthesis System
All the VHDL codes are synthesized to RTL netlist using a fully-depleted silicon-on-
insulator (FDSOI) 22-nm CMOS design kit characterized at a typical 0.9-V corner.
The VHDL codes are first compiled and evaluated for pre-synthesis functionality by
Cadence Incisive. Then the codes are sent to the synthesis engine (Cadence Genus)

16


3. Methods

to generate the RTL design according to the design library. When the synthesis
finishes, the synthesis engine also produces estimation reports of power consump-
tion and area usage. Simulation is performed again to ensure the functionality of
the post-synthesis netlist. To increase efficiency, an auto-synthesis system is used
based on Bash. This script automates the evaluation process, saving time and labor
by automatically running pre-simulation, synthesis, post-simulation, and collecting
data from the reports.

3.5 BSC Simulator
A BSC simulator is proposed to evaluate the performance of the post-synthesis RTL
netlists. The BSC simulator generates random bit errors according to the specific
error rate. The errors are added to the encoded codewords and sent to the decoder,
where the decoded codewords are compared with their encoded codewords, and
uncorrected errors are counted to calculate the output BER. The power consumption
is also estimated based on the value change dump (VCD) file generated by the pre-
simulation.

17


3. Methods

18


4
Design

In this chapter, a power evaluation system is implemented to estimate the power
reduction offered by FECs, and the final design is solidified with the help of the
estimation model.

4.1 Power Evaluation System
In order to handle the high-temperature conditions, FECs are integrated to decrease
the output BER. To prevent exacerbation of high-temperature environments, the
additional power dissipation introduced by FECs should be lower than the total
system power reduction offered by FECs. In order to demonstrate that, a realistic
fiber-optic communication evaluation system needs to be built. The evaluation
system should mimic the performance of the transmitter, such as the driver and
VCSEL power, the influence of the transmission medium, and the performance of the
receiver. An evaluation system is proposed in this design based on the transmitter
power dissipation numbers from [29, 30, 31, 32] and the VCSEL performances from
[33, 34]. The effect of the receiver is negligible due to its relatively minor impact
compared to the transmitter.
Elevated temperatures can induce additional noise because of thermal fluctuations.
The integration of FECs can help to protect the channel, enabling the system to
operate reliably at a higher BER condition. Suitable FECs can also offer power
relaxation to the whole system. By detecting and correcting the errors caused by the
medium, the FEC can decrease the BER of the received signal, and hence decrease
the OMA requirement at the receiver. Therefore the first thing is to find the OMA
reduction when integrating FECs. An example is shown in figure 4.1, for the BER
of 10−12, the FEC with only t = 3 can result in around 3 dB OMA relaxations in a
55 Gbps throughput. The 3-dB relaxation at the transceiver will lead to the same
OMA reduction at the transmitter.
Figure 4.2 reflects the relationship between throughput and OMA reduction, it can
be concluded that when the system operates at higher throughput, the FEC can
offer more power reduction from below 2.5 dB of 20 Gbps to above 3.5 dB of 57
Gbps. The OMA reduction finally relaxes the electrical power requirement at the
transmitter, thus leading to a more power-efficient point. Figures 4.3 and 4.4 show
the power reduction of t = 3, 4 BCH FECs at a 55-Gbps data rate. For the ultra-
low (20-mW) transmitter system, the BCH FECs can offer around 10 mW power
reduction, while for the 190-mW transmitter system, it can provide up to 113 mW

19


4. Design

-4 -3 -2 -1 0 1 2 3

Receiver OMA, dBm

-14

-13

-12

-11

-10

-9

-8

-7

-6

-5
lo

g
1

0
(B

E
R

)

-12dB

uncoded

BCH(63,45,3)

BCH(127,106,3)

BCH(255,231,3)

BCH(511,484,3)

BCH(1023,993,3)

Figure 4.1: Post-FEC BER as a func-
tion of the receiver OMA at 55 Gbps

20 40 50 55 57

Throughput, Gbps

1

1.5

2

2.5

3

3.5

4

4.5

5

O
M

A
 r

e
d
u
c
ti
o
n
 a

t 
re

c
e
iv

e
r,

 d
B

BCH(63,45,3)

BCH(127,106,3)

BCH(255,231,3)

BCH(511,484,3)

BCH(1023,993,3)

Figure 4.2: Receiver OMA reduction
as a function of different throughput

6 7 8 9 10

m

0

20

40

60

80

100

120

P
o
w

e
r 

re
d
u
c
ti
o
n
 (

m
W

)

TX 20 mW TX 72 mW TX 120 mW TX 190 mW

Figure 4.3: Power reduction as a func-
tion of m for t = 3 at 55 Gbps

6 7 8 9 10

m

0

20

40

60

80

100

120

P
o
w

e
r 

re
d
u
c
ti
o
n
 (

m
W

)

TX 20 mW TX 72 mW TX 120 mW TX 190 mW

Figure 4.4: Power reduction as a func-
tion of m for t = 4 at 55 Gbps

power relaxation. If the additional power added by the FECs is less than those power
reductions caused by FECs, then the harsh environment will not be exacerbated.

4.2 Galois Field Multiplier
The Galois field multiplier is a critical component for the decoder because the key
equation solver and Chien search both need that component. There are different
kinds of Galois field multipliers, such as Mastrovito multipliers [13], Karatsuba mul-
tipliers [35], and Chinese remainder theorem (CRT) multipliers [36]. The space and
time complexities are generally used to evaluate the performance of GF multipliers.
The space complexity is commonly represented by the total numbers of XOR gates
and AND gates, and the time complexity is evaluated by the critical path, given the
delay of XOR gates TX and AND gates TA.

From the above multipliers, Mastrovito and Karatsuba are the most commonly used
[37]. Mastrovito multipliers are commonly used in implementations with a value of
m that is small since the space complexity may be noncritical while using quadratic
multipliers, i.e., Mastrovito, can obtain a high-speed design [37]. However, for the
larger values of m, Karatsuba counterparts are more attractive in regard to space

20


4. Design

complexity. In order to select a suitable multiplier, different implementations based
on Karatsuba [35, 38, 39] and Mastrovito [40, 41] are analyzed, and a modular
reduction multiplier [42] is also evaluated since a previous design [8] utilizes this
method. Because the space complexity is not that critical for lower m, the analysis
is performed in terms of time complexity. The comparison of mentioned multipliers
is shown in figure 4.5.

5 6 7 8 9 10

m

4

5

6

7

8

N
u

m
b

e
r 

o
f 

X
O

R
 g

a
te

s
 i
n

 c
ri
ti
c
a

l 
p

a
th

Proposed by [38] (Karatsuba)

Proposed by [35] (Karatsuba)

Proposed by [39] (Karatsuba)

Proposed by [40] (Mastrovito)

Proposed by [42] (Modular Reduction)

Proposed by [41] (Mastrovito)

Figure 4.5: The timing comparison of different Galois field multipliers over order
m

5 6 7 8 9 10

m

0

50

100

150

200

250

G
a
te

 C
o
u
n
t 
(#

X
O

R
+

#
A

N
D

)

K
a
ra

ts
u
b
a
 [
3
8
]

K
a
ra

ts
u
b
a
 [
3
5
]

K
a
ra

ts
u
b
a
 [
3
9
]

M
a
s
tr

o
v
it
o
 [
4
0
]

M
o
d
u
la

r 
R

e
d
u
c
ti
o
n
 [
4
2
]

M
a
s
tr

o
v
it
o
 [
4
1
]

Count of AND gates Count of XOR gates

Figure 4.6: Gate count of different Galois-field multipliers over order m

The timing comparison in figure 4.5 only considers the delay of XOR gates, given the
delay caused by the AND gates is constant through different methods. From figure
4.5, it can be concluded the Mastrovito multiplier in [41] is the fastest because of
the elimination of useless XOR gates and optimization of XOR trees. Therefore, we
implemented the Mastrovito multiplier from [41]. Figure 4.6 evaluates the hardware

21


4. Design

complexity of different methods by counting the AND gates and XOR gates. Karat-
suba multipliers from [39] offer the lowest hardware complexity but pay in terms of
timing. The Mastrovito multipliers in [41] also provide comparable lower hardware
complexities. It should be noted that for m = 8, there is an absence of implementa-
tions based on the pentanomial for Karatsuba multipliers [35, 38, 39, 40, 42], thus
they are not included in figures 4.5 and 4.6.
The Mastrovito multiplier performs matrix multiplication. It optimizes the multi-
plication C = A · B to c = M · b, where c and b are the binary-coefficient vectors of
A and B, and M is a multiplication matrix generated by some elements of a. Then
the Mastrovito multiplier can be sorted into two steps.
The first steps is to compute the matrix M, assuming an irreducible polynomial
p(x) of GF(2m) is:

p(x) = xm + pm−1x
m−1 + . . . + p1x + 1, (4.1)

and the equation 4.1 has the roots r = [1, α, α2, ..., αm−1]. Given the polynomial
representations of elements A and B:

A = a0 + a1α + . . . + am−1α
m−1 = r · aT , (4.2)

B = b0 + b1α + . . . + bm−1α
m−1 = r · bT , (4.3)

then the calculation of C can be rewritten to:

C = b0A + b1 · A · α + . . . + bm−1 · A · αm−1

= [A, A · α, . . . , A · αm−1] · bT ,
(4.4)

where b = [b0, b1, ..., bm−1]. Introducing the Mi = [M0,i, M1,i, . . . , Mm−1,i] and refor-
mulating equation 4.4 to:

C = r · [MT
0 , MT

1 , . . . , MT
m−1] · bT , (4.5)

gives the relationship:

A · αi = r · MT
i . (4.6)

From equation 4.6, the first column of matrix M is equal to a, and the other columns
can be computed by this first column:

r · MT
i = r · α · MT

i−1. (4.7)

Then the computation of the following column of matrix M becomes [41]:

M0,i = Mm−1,i−1 i = 1, . . . , m − 1
Mj,i = Mj−1,i−1 + Mm−1,i−1 · pj

i, j ≤ m − 1 positive integer.
(4.8)

22


4. Design

Since pj is constant for each design, then the calculation of matrix M is the actual
wire connection and XOR operations.

The second step is comparably simple:

cj = Mj,0 · b0 + Mj,1 · b1 + . . . + Mj,m−1 · bm−1, (4.9)

which can be implemented by m AND gates and m − 1 XOR gates for binary
multiplications and additions. A Mastrovito multiplier for GF(26) with p(x) =
x6 + x3 + 1 is shown in figure 4.7.

Figure 4.7: The Mastrovito multiplier structure for GF (26) with p(x) = x6 +x3 +1

The first block is used to calculate the matrix M and is composed of XOR gates and
wire connections. The XOR gate is added depending on the value of pj; when pj = 1
then an XOR gate should be employed to calculate the element of the next column,
otherwise, only a wire connection is needed. Hardware sharing can be exploited by
tracking back the representation of the current result by the first column elements,
for M3,4 in the above example can be represented by M2,0 XOR M5,0 XOR M5,0 that
can be simplified to M2,0 since XORing the same value will become zero and any
value XORed with the zero is equal to itself. The second block is comprised of m
AND gates and m − 1 XOR gates and is used to compute c = M · b. Figure 4.7
only shows one computation unit of cj, whereas, for the whole block, it contains m
computation units.

4.3 Berlekamp-Massey Decoder
In this design, we implement a decoder based on a look-ahead reformulated inver-
sionless BM (riBM) algorithm with t = 3 or 4.

23


4. Design

4.3.1 Syndrome Calculation
The hardware structure of syndrome calculation is quite simple. From equation
2.13 and matrix 2.10, the syndrome calculation is composed of bit addition and
multiplication. As mentioned before, the binary bit addition can be implemented
by the XOR gate, and multiplication can be implemented by the AND gate. And if
one element of AND operation is constant, the AND gate can be replaced by a wire
connection or disconnection. Therefore, the syndrome calculation can be simplified
to XOR trees generated by the parity-check matrix H.
In matrix 2.10, each GF element can be rewritten as an m binary vector. Hence
the hardware complexity of the syndrome calculation can be represented by 2 · tm
XOR trees, and each tree contains at most n−1 XOR gates and average n

2 −1 XOR
gates, assuming each row is filled by half ones. Then the maximum and average total
XOR gates become 2tm(n − 1) and 2tm(n

2 − 1) XOR gates, respectively, assuming
symmetric trees for lower logic depth (see figure 4.8).

Figure 4.8: One syndrome-bit calculation structure assuming the worst case n
inputs

From figure 4.8, the worst-case complexity of syndrome calculation is

TSD = log2(n) · Txor, (4.10)
ASD = 2tm(n − 1) · Axor, (4.11)

and the average complexity when the H is filling with half ones is

TSD = log2(
n

2 ) · Txor, (4.12)

ASD = 2tm(n

2 − 1) · Axor. (4.13)

4.3.2 Key Equation Solver
Since Berlekamp and Massey introduced the primitive BM algorithm, numerous en-
hanced algorithms have been developed building its foundation in efforts to decrease

24


4. Design

the hardware complexity and improve the timing performance. A literature review
is presented below regarding the hardware complexity and timing complexity at a
component level. Specifically, the focus is evaluating the effects of the GF multi-
pliers, GF adders, and multiplexers (MUXs). Regarding the hardware complexity,
area usage, the effect of the register is also included. The timing complexities and
hardware complexities of the GF multipliers are from section 4.2 with a dynamic
range. For two m-bit inputs GF adders, the critical path is equal to one XOR gate,
and the area usage is m XOR gates. The two m-bits inputs MUX has the same
delay as the two 1-bit inputs MUX and has an area usage of m two 1-bit inputs
MUX. The registers are modeled in the same way as the MUX. The complexities of
different components are summarized in table 4.1 and 4.2, where A and T represent
the area and delay of the basic logic gate, i.e. Axor is the area usage of one XOR
gate and Txor is the delay of one XOR gate.

Hardware Complexity

Multiplier
Trinomial n = 1 (m2 − 1) · Axor + m2 · Aand

Trinomial 2 ≤ n ≤ m/2 (m2 − 1) · Axor + m2 · Aand

Pentanomial (m(m − 1) + 3(m − 1)) · Axor + m2 · Aand

Adder m · Axor

MUX m · Amux

REG m · Areg

Table 4.1: The hardware complexity of each component

Timing Complexity

Multiplier
Trinomial n = 1 ⌈log2(2m − 1)⌉ · Txor + Tand

Trinomial 2 ≤ n ≤ m/2 ⌈log2(2m + 2n − 3)⌉ · Txor + Tand

Pentanomial ⌈log2(4m + 12n − 21)⌉ · Txor + Tand

Adder Txor

MUX Tmux

REG Treg

Table 4.2: The timing complexity of each component

The primitive BM algorithm [20, 21] has a longer critical path that depends on the
error correction ability. As can see from algorithm 1, the delay in computing dk is
Tmult + ⌈log2(lk + 1)⌉ · Tadd, where Tmult and Tadd are the delays of two m-bits inputs
GF multiplier and adder, respectively. Then, considering the worst case when the
discrepancy is nonzero, the adjustment of Λk+1(x) adds t GF multiplications and
t − 1 additions to the circuit and increases the maximum delay to

TBM = 2 · Tmult + (⌈log2(lk + 1)⌉ + ⌈log2(t)⌉) · Tadd + TMUX , (4.14)

where TMUX is the delay of two m-bits input MUX. This delay only contains the
influence of GF multiplier and adder, neglecting the effect of other circuitry such as
LUT for inversion. Including these effects, the critical path will be larger. Regarding
the hardware complexity, the calculation of dk needs lk GF multipliers and GF

25


4. Design

adders, resulting in lk · (Amult +Aadd), where the Amult and Aadd are the usage of one
GF multiplier and one adder. The computing of Λk+1(x) require t+1 GF multipliers,
t + 1 adders, t MUXs and t + 1 registers, leading to a total hardware complexity of

ABM = (lk + t + 1) · Amult + (lk + t + 1) · Aadd + t · AMUX + (t + 1) · AREG.
(4.15)

For binary BCH codes, the even iterations can be eliminated because the result
discrepancies are always zeros [10]. Hence the iteration count decreases to t, but the
timing and hardware complexities are same

TBM = 2 · Tmult + (⌈log2(lk + 1)⌉ + ⌈log2(t)⌉) · Tadd, (4.16)
ABM = (lk + t + 1) · Amult + (lk + t + 1) · Aadd + t · AMUX + (t + 1) · AREG.

(4.17)

In order to replace the hardware and timing consuming GF inversion in primitive BM
algorithm, inversionless BMs (iBM) were proposed and implemented in [43, 44]. In
those designs, the arduous task of division or LUT used to implement the inversion
is replaced by multiplication, leading to a fast circuit. Based on an implementation
from [27], the complexities of iBM change to

TiBM = 2 · Tmult + (1 + ⌈log2(t + 1)⌉) · Tadd, (4.18)
AiBM = (5t + 3) · Amult + (3t + 1) · Aadd + (2t + 1) · AMUX + (4t + 2) · AREG.

(4.19)

Obviously, the increased hardware complexity pays for the decrease in timing com-
plexity. To further reduce the critical path, reformulated inversionless algorithms
(riBM) were introduced and implemented in [27]. Those algorithms compute the
next discrepancy ∆k+1(x) and the next error location polynomial Λk+1(x) at the
same time, reducing the longest delay to

TriBM = Tmult + Tadd, (4.20)

by incurring the cost of only t − 1 additional multipliers,

AriBM = (6t + 2) · Amult + (3t + 1) · Aadd + (3t + 1) · AMUX + (6t + 2) · AREG. (4.21)

The RiBM algorithm is just a rearranged version of riBM, aiming to make the
processing engine identical. Therefore, the complexity of RiBM is the same as the
riBM. Subsequently, the truncated iBM (TiBM) [45] and compensated simplified
RiBM (CS-RiBM) [46] were proposed to replace the unnecessary computations for
evaluation polynomial in RiBM with simple compensation for the RS(255,239) code
with 16 iteration times. For this design using BCH code, the iteration time decreases
from 2t to t, and the target error correction abilities are t = 3, 4. Therefore, the
iteration time is not that large so TiBM and CS-RiBM algorithm are not considered.
In order to find a suitable algorithm, the hardware and timing complexities of BM,
iBM, and riBM algorithms for the binary BCH code with t = 3 and 4 are compared in
figures 4.9 to 4.12. The performance of both CS-RiBM and TiBM is also illustrated
for the purpose of comparison.

26


4. Design

3 4 5 6 7 8 9 10

m

4

6

8

10

12

14

16

18

20

C
ri
ti
c
a
l 
P

a
th

 N
u
m

b
e
r 

o
f 
L
o
g
ic

 G
a
te

s
 (

t p
d
)

BM iBM riBM TiBM CS-RiBM

Figure 4.9: Number of logic gates in
the critical path for t = 3

3 4 5 6 7 8 9 10

m

0

500

1000

1500

2000

2500

3000

3500

4000

4500

G
a
te

 C
o
u
n
ts

 (
#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)

BM iBM riBM TiBM CS-RiBM

Figure 4.10: Gate count, including the
XOR, AND, MUX, and REG gates, for
t = 3

3 4 5 6 7 8 9 10

m

4

6

8

10

12

14

16

18

20

C
ri
ti
c
a
l 
P

a
th

 N
u
m

b
e
r 

o
f 
L
o
g
ic

 G
a
te

s
 (

t p
d
)

BM iBM riBM TiBM CS-RiBM

Figure 4.11: Number of logic gates in
the critical path for t = 4

3 4 5 6 7 8 9 10

m

0

1000

2000

3000

4000

5000

6000

G
a
te

 C
o
u
n
ts

 (
#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)
BM iBM riBM TiBM CS-RiBM

Figure 4.12: Gate count, including the
XOR, AND, MUX, and REG gates, for
t = 4

The timing of TiBM and CS-RiBM is overlapping in figures 4.9 and 4.11 since the
timing of both are the same. From figures 4.9 to 4.12, riBM exhibits superior timing
performance across different m, albeit at the expense of increased area usage. The
performance of the key equation solver varies in accordance with the structure of
the multiplier and has the longest path at m = 8, since no irreducible trinomial
exits for m = 8 and a pentanomial is used. Consequently, it is advisable to avoid
codes with m = 8 from a hardware perspective.

4.3.3 riBM Structure
This design utilizes the look-ahead reformulated inversionless Berlekamp-Massey
algorithm (riBM) [27, 47] for the key equation solver. This algorithm calculates the
next discrepancy and error location polynomial at the same time, eliminating the
feedback loop at the primitive BM algorithm. Since the discrepancy used to control
the MUX does not necessitate computation within the current cycle, the critical path

27


4. Design

is correspondingly shortened. In addition, the hardware-expensive inversions in BM
are replaced by multiplication, resulting in simple circuitry. The riBM algorithm is
provided in the below pseudocode.

Algorithm 2: The look-ahead riBM Algorithm for binary BCH code
input: Si (1 ≤ i ≤ 2t)
initialization: Λ(−1)(x) = 1, B(−1)(x) = x−1,

k(−1) = −1, γ(−1) = 1,
∆(−1)(x) = S1x + S2x

2 + . . . + S2tx
(2t),

Θ(−1)(x) = S1 + S2x + . . . + S2tx
(2t−1).

begin: for r = −1, 1, 3, ..., 2t − 3
Λ(r+2)(x) = γ(r)Λ(r)(x) + ∆(r)

1 x2B(r)(x)
∆(r+2)(x) = γ(r)∆(r)(x)/x2 + ∆(r)

1 Θ(r)(x)
if ∆(r)

1 ̸= 0 and k(r) ≥ −1 then
B(r+2)(x) = Λ(r)(x)
Θ(r+2)(x) = ∆(r)(x)/x2

γ(r+2) = ∆(r)
1

k(r+2) = −k(r) − 2
else

B(r+2)(x) = x2B(r)(x)
Θ(r+2)(x) = Θ(r)(x)
γ(r+2) = γ(r)

k(r+2) = k(r) + 2
end if

output: Λ(x) = Λ(2t−1)(x) when r = 2t − 1

The input to the key equation solver is 2t syndrome from the syndrome calculation
components. In the beginning, the syndrome is initialized to two register vectors,
∆(r)(x) and Θ(−1)(x). Then the error location polynomial Λ(r+2)(x) and the error
evaluation polynomial ∆(r+2)(x) are calculated according to the initialization values.
The update of Λ(r+2)(x) requires 2t + 2 GF multipliers and t + 1 adders, and the
update of ∆(r+2) needs 4t GF multiplier and 2t adders. Hence the total numbers of
required multipliers and adders are 6t + 2 and 3t + 1, respectively. According to the
current discrepancy ∆(r)

1 and current k(r), the MUXs are implemented to control the
updating of the registers of B(r+2), Θ(r+2)(x), γ(r+2), and k(r+2). The multiplication
and division of x2 can be achieved by signal indexing. An additional control counter
is implemented to govern the operation of the above combinational logic.
Like the design in [27, 47], this implementation does follow an engine-by-engine
structure. The process engine zero (PE0) for computing ∆(r+2)(x) and Λ(r+2)(x)
is almost the same except for the initial data. One computation unit of Λ(r+2)(x)
is shown in figure 4.13. To maintain a clean figure, the control unit is omitted.
The computation of one Λ(r+2)(x) needs two multipliers and one adder. At the first
iteration, the registers load the initial data connecting by MUXs, and the multiplier
and adder compute the Λ(r+2)(x), simultaneously. Meanwhile, the MUXs switch to
the feedback loop, transmitting Λ(r+2)(x) for the subsequent computation.

28


4. Design

Figure 4.13: Process Engine (PE0) for
Λ(r+2)(x) and B(r+2)

Figure 4.14: Process Engine (PE1) for
∆(r+2)(x) and Θ(r+2)(x)

The value of B(r+2) is updated based on the value of ∆(r)
1 and k(r) and is then fed

back to prepare for the next computation. Because the maximum length of Λ(r+2)

is t + 1, t + 1 PE0 processors are needed.
The theoretical critical path of one PE0 is

TP E0 = Tmult + Tadd + TMUX , (4.22)

which is one TMUX higher than equation 4.20 due to the inclusion of input assignment
MUXs. The process engine one (PE1) used to compute ∆(r+2)(x) and Θ(r+2)(x) is
similar to the PE0, with the same critical path

TP E1 = Tmult + Tadd + TMUX . (4.23)

However, since the maximum length of ∆(r+2)(x) and Θ(r+2)(x) is 2t, 2t PE1 pro-
cessors are needed. Finally, t + 1 PE0 and 2t PE1 are connected according to the
index (see figure 4.15). The control signals of the MUX have been excluded for the
clarity of the figure.

Figure 4.15: The connection of PE0 and PE1 processor based on the index

Thus the critical path is

TriBM_imp = Tmult + Tadd + TMUX . (4.24)

29


4. Design

Considering the input assignment MUXs, the hardware complexity is

AriBM_imp = (6t+2) ·Amult +(3t+1) ·Aadd +(9t+3) ·AMUX +(6t+2) ·AREG. (4.25)

4.3.4 Parallel Chien Search
Due to the low latency requirement, the unrolled parallel Chien search structure
is used in this design. First, the n-by-t element array E is generated by the FEC
simulator based on the error correction ability and block length. In contrast to the
multiplier utilized in the riBM processor, the multiplier employed in the Chien search
performs multiplication with a constant element. This means that the multiplier in
Chien search can be simplified to only XOR gates. The multiplier with a constant
value has the critical path and area usage of

Tconst = ⌈log2(m)⌉ · Txor, (4.26)

Aconst = (1
2m2 − m) · Axor, (4.27)

from [27, 48]. For the computation of each row, t constant multipliers and t adders
are employed. The bit of error correction vector is initialized to zero at the beginning
and is flipped to logic one according to the error location. The first row of a Chien
search component is shown in figure 4.16, where x represents the received codeword,
c is the correction pattern, and dc is the decoded codeword.

Figure 4.16: One example structure for the first row of Chien search component
with t = 3

From figure 4.16, one row of Chien search computation and error correction involves
t constant multipliers, t adders, one MUX, one XOR gate, and four m-bits registers.
Then the complexity of the Chien search component is

30


4. Design

TCS = Tconst + ⌈log2(t)⌉ · Tadd + Tmux + Txor, (4.28)
ACS = nt · Aconst + nt · Aadd + n · Amux + n · Axor, (4.29)

4.4 System Architecture
To attain a high throughput and lower power design, it is crucial to establish a
well-designed system architecture that prioritizes lower latency and lower power
consumption, especially dynamic power. This section provides an analysis of three
different system architectures, direct iteration (DI), parallel iteration (PI), and un-
rolled pipeline (UP). Only two architectures, PI and UP, are implemented and syn-
thesized with clock-gating technology to decrease the dynamic power. First, to
comprehensively understand the timing and hardware complexity of those three ar-
chitectures, the complexities are thoroughly evaluated based on theoretical analysis
down to the logic gate level, using the formula from tables 4.1 and 4.2 and the com-
plexities of different components. It should be acknowledged that the evaluation
presented below utilizes the average complexity (see equations (4.12) and (4.13)) of
syndrome calculation.

4.4.1 Direct Iteration
The most hardware-efficient architecture involves utilizing a single riBM core itera-
tively processing the data. However, this architecture will result in a lower through-
put due to requiring t iterations for one error location polynomial calculation in
riBM component.

Figure 4.17: The direct iteration (DI) structure

Registers are employed at the input and output and are inserted between each com-
ponent to allow a higher clock speed (see figure 4.17). This system can only process
one received codeword per t clock cycles, significantly decreasing the throughput.
To achieve a 56 Gbits/sec throughput, this architecture needs a clock frequency fclk

of

fclk = t · 56/k GHz (4.30)

where k is the message length. The hardware complexities and timing complexities
of different components are shown in figures 4.18 to 4.21.
The starting value of m is set at five due to the absence of a t = 4 BCH code below
that threshold. From figures 4.18 and 4.20, the critical path is in the Chien search

31


4. Design

5 6 7 8 9 10

m

3

4

5

6

7

8

9

C
ri
ti
c
a
l 
P

a
th

 N
u
m

b
e
r 

o
f 
L
o
g
ic

 G
a
te

s
 (

t p
d
)

Syndrome riBM Chien Search

Figure 4.18: Number of logic gates in
the critical path for a DI structure with
t = 3

5 6 7 8 9 10

m

0

1

2

3

4

5

6

7

8

9

10

G
a

te
 C

o
u

n
t 

* 
1

0
4
 (

#
X

O
R

+
#

A
N

D
+

#
M

U
X

+
#

R
E

G
)

15.7546

Syndrome riBM Chien Search

Figure 4.19: Gate count, including the
XOR, AND, MUX, and REG gates, for
a DI structure with t = 3

5 6 7 8 9 10

m

3

4

5

6

7

8

9

C
ri
ti
c
a
l 
P

a
th

 N
u
m

b
e
r 

o
f 
L
o
g
ic

 G
a
te

s
 (

t p
d
)

Syndrome riBM Chien Search

Figure 4.20: Number of logic gates in
the critical path for a DI structure with
t = 4

5 6 7 8 9 10

m

0

1

2

3

4

5

6

7

8

9

10

G
a
te

 C
o
u
n
t 
* 

1
0

4
 (

#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)

20.8697

Syndrome riBM Chien Search

Figure 4.21: Gate count, including the
XOR, AND, MUX, and REG gates, for a
DI structure with t = 4

component when m is lower than eight, however, at m = 8, the critical path is
in riBM core due to the complex pentanomial Mastrovito multiplier. In terms of
hardware complexity, the Chien search is the most resource-intensive component,
and its logic use tends to escalate exponentially. For m = 5, 6, the riBM component
exhibits a larger area utilization compared to the syndrome component. However,
as m surpasses six the area usage of the syndrome component is larger than riBM’s.
Since the total area usage exhibits an exponential upward trend with increasing m,
a small m should be used to achieve a lower area usage.

Small m means short block length and message length, resulting in a high fclk

needed to maintain the high throughput requirement. For example, implementing
a BCH(31,16,3) code with DI structure would need a 10.5 GHz clock to achieve a
56 Gbits/sec throughput based on equation 4.30. The circuit with that high-speed
clock will consume a significant amount of power according to the dynamic power

32


4. Design

formula
Pdyn = αCchipV 2

DDfclk (4.31)

where α is the activity factor, Cchip is the switched capacitance and VDD is the
supply voltage. Therefore, despite the DI structure is hardware efficient, it suffers
the problem of high latency and high power consumption.

4.4.2 Parallel Iteration
The iterative structure of DI implementation limits the achievable throughput. To
reduce the latency, incorporating more riBM cores in parallel into the circuit can be
an effective approach. Therefore, a parallel iteration (PI) architecture is proposed
with three riBM cores in parallel (see figure 4.22). To enable core selection and
ensure accurate output from the respective core, an additional de-multiplexer (DE-
MUX) and multiplexer (MUX) are integrated into the circuit. The state of DEMUX
and MUX is controlled by the state controller, which stores the current positions and
busy flags of those two selectors. In order to synchronize the arrival of the received
codeword and its corresponding error location polynomial, a FIFO (First-In-First-
Out) buffer is added. The received codeword is delayed by t + 1 cycles, based on
the error correction ability of the design.

Figure 4.22: The structure of parallel iteration (PI) system with t = 3

By parallelizing three riBM cores, the delay caused by the core is reduced to only
one clock cycle. Therefore the required clock frequency for 56 Gbits/sec throughout
reduces to

fclk = 56/k GHz. (4.32)

Unfortunately, the integration of more riBM cores in this architecture leads to a
more than threefold increment in area usage compared with DI because of the re-
quired control logic. The state controller controls the dispatch of each syndrome.
In addition, clock gating is employed to reduce the power consumption. Enable
signals are added to the registers for riBM cores and registers for Chien search.
When there are no errors in the received codeword, the syndrome is zero, and the
state controller disables the riBM cores and Chien search core according to the busy

33


4. Design

5 6 7 8 9 10

m

3

4

5

6

7

8

9

C
ri
ti
c
a

l 
P

a
th

 N
u

m
b

e
r 

o
f 

L
o

g
ic

 G
a

te
s
 (

t p
d
)

Syndrome riBM Chien Search

Figure 4.23: Number of logic gates in
the critical path for a PI structure with
t = 3

5 6 7 8 9 10

m

0

1

2

3

4

5

6

7

8

9

10

G
a
te

 C
o
u
n
t 
* 

1
0

4
 (

#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)

15.8569

Syndrome riBM Chien Search

Figure 4.24: Gate count for a PI struc-
ture with t = 3

5 6 7 8 9 10

m

3

4

5

6

7

8

9

C
ri
ti
c
a

l 
P

a
th

 N
u

m
b

e
r 

o
f 

L
o

g
ic

 G
a

te
s
 (

t p
d
)

Syndrome riBM Chien Search

Figure 4.25: Number of logic gates in
the critical path for a PI structure with
t = 4

5 6 7 8 9 10

m

0

1

2

3

4

5

6

7

8

9

10

G
a
te

 C
o
u
n
t 
* 

1
0

4
 (

#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)

20.972

Syndrome riBM Chien Search

Figure 4.26: Gate count for a PI struc-
ture with t = 4

state of the riBM cores. If the syndrome result is not zero, which means errors are
presented in the received codeword, the state controller activates one non-busy core
and also enables the Chien search core to find the error locations and correct the
errors. A MUX is added between the Chien search core and the final output, aiming
at skipping the Chien search core when it is disabled. Hence the critical path of
the modified Chien search core increases by one MUX. The complexity is shown in
figures 4.23 to 4.25 excluding the FIFO and state controller.
The Chien search component has the longest critical path except m = 8 due to the
one extra MUX. For m = 8, the riBM core dominates because of the pentanomial
Mastrovito multiplier. Regarding hardware complexity, the riBM cores exhibit the
largest area usage when m = 5, 6, however, for other values of m the Chien search
demonstrates the largest area. A larger value of m results in larger area usage and
may result in higher power consumption. Therefore, m = 6 BCH decoders are
implemented in this design.

34


4. Design

4.4.3 Unrolled Pipeline
Even though PI reduces the latency caused by the riBM core, integrating DEMUX
and MUX, and parallelizing the cores increase the hardware complexity and may
lead to higher power consumption which violates the lower power goal. To constrain
power consumption, a power and hardware-efficient unrolled pipeline (UP) architec-
ture is proposed. Unrolling is possible for the main iteration in algorithm 2 since it
follows a regular pattern and does not involve any complex dependencies. Unrolling
can also help to reduce hardware complexity compared with PI architecture. From
the Algorithm 2, since the Λ(−1)(x) = 1, γ(−1) = 1 and B(−1)(x) = x−1, the first
iteration can be simplified to algorithm 3.

Algorithm 3: The simplified first iteration of look-ahead riBM Algorithm
input: Si (1 ≤ i ≤ 2t)
initialization: Λ(−1)(x) = 1, B(−1)(x) = x−1,

k(−1) = −1, γ(−1) = 1,
∆(−1)(x) = S1x + S2x

2 + . . . + S2tx
(2t),

Θ(−1)(x) = S1 + S2x + . . . + S2tx
(2t−1).

begin: for r = −1
Λ(r+2)(x) = 1 + ∆(r)

1 x

∆(r+2)(x) = ∆(r)(x)/x2 + ∆(r)
1 Θ(r)(x)

if ∆(r)
1 ̸= 0 and k(r) ≥ −1 then

B(r+2)(x) = 1
Θ(r+2)(x) = ∆(r)(x)/x2

γ(r+2) = ∆(r)
1

k(r+2) = −1
else

B(r+2)(x) = x
Θ(r+2)(x) = Θ(r)(x)
γ(r+2) = 1
k(r+2) = 1

end if
output: Λr+2(x), ∆(r+2)(x), B(r+2)(x), Θ(r+2)(x), γ(r+2), k(r+2)

The GF multipliers and adders used to compute Λ(r+2)(x) can be replaced by direct
assignments of constant one and ∆(r)

1 , actually S1. Moreover, t+1 GF multipliers are
removed in the calculation circuit of ∆(r+2)(x). The MUXs used to update B(r+2)(x)
can be reduced to two due to only the first two elements of B(r+2)(x) are needed for
the next iteration. With those optimizations, the hardware complexity of the first
riBM core reduces to

AriBM_first = 2t · Amult + 2t · Aadd + (2t + 3) · AMUX + (4t + 2) · Areg. (4.33)

Given that no additional logic is incorporated into the core, the timing complexity
remains the same as equation 4.20. The last riBM core can also be optimized (see
algorithm 4) due to only the error location polynomial being needed.

35


4. Design

Algorithm 4: The simplified last iteration of look-ahead riBM Algorithm
input: Λ(r)(x), B(r)(x), γ(r)

begin: for r = 2t − 3
Λ(r+2)(x) = γ(r)Λ(r)(x) + ∆(r)

1 x2B(r)(x)
output: Λ(x) = Λ(2t−1)(x) when r = 2t − 1

Then the hardware complexity of the last riBM core reduces to

AriBM_last = 2t · Amult + 2t · Aadd + 2t · Areg. (4.34)

Unfortunately, the cores located between the first and last positions can not be
significantly simplified since they require calculation of all parameters. For the
riBM cores in between, the complexity is

AriBM_btn = (6t+2) ·Amult +(3t+1) ·Aadd +(3t+1) ·AMUX +(6t+2) ·AREG. (4.35)

By pipelining the riBM cores, a UP structure with t = 3 is shown below

Figure 4.27: The structure of Unrolled Pipeline (UP) system with t = 3

The clock gating method is also implemented in this architecture aiming at reducing
the total power consumption. Like the PI architecture, there is a controller to send
the enable signal based on the resulting output of the syndrome. When there are
no errors, the controller disables the riBM cores and the Chien search component,
and the received codeword delayed by the FIFO directly connects to the output
registers. It should be noted that due to the unrolling pipeline structure and each
iteration is implemented by an individual core, the output MUXs in PE0 and PE1
are not needed and the timing complexity of riBM cores decreases to

TriBM_UP = Tmult + Tadd. (4.36)

Subsequently, the hardware and timing complexities are shown in figures 4.28 to 4.31.
The Chien search component has the longest critical path from the range of m = 5
to m = 10 due to the one extra MUX. Regarding hardware complexity, the riBM
cores exhibit the largest relative area usage when m = 5, 6, however, for other values
of m, the Chien search dominates. In this design, the BCH code with m = 6 and
t = 3, 4 are implemented.

36


4. Design

5 6 7 8 9 10

m

3

4

5

6

7

8

9

C
ri
ti
c
a

l 
P

a
th

 N
u

m
b

e
r 

o
f 

L
o

g
ic

 G
a

te
s
 (

t p
d
)

Syndrome riBM Chien Search

Figure 4.28: Number of logic gates in
the critical path for a UP structure with
t = 3

5 6 7 8 9 10

m

0

1

2

3

4

5

6

7

8

9

10

G
a
te

 C
o
u
n
t 
* 

1
0

4
 (

#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)

15.8569

Syndrome riBM Chien Search

Figure 4.29: Gate count for a UP struc-
ture with t = 3

5 6 7 8 9 10

m

3

4

5

6

7

8

9

C
ri
ti
c
a

l 
P

a
th

 N
u

m
b

e
r 

o
f 

L
o

g
ic

 G
a

te
s
 (

t p
d
)

Syndrome riBM Chien Search

Figure 4.30: Number of logic gates in
the critical path for a UP structure with
t = 4

5 6 7 8 9 10

m

0

1

2

3

4

5

6

7

8

9

10

G
a
te

 C
o
u
n
t 
* 

1
0

4
 (

#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)

20.972

Syndrome riBM Chien Search

Figure 4.31: Gate count for a UP struc-
ture with t = 4

4.4.4 State of the Art
Figures 4.32 to 4.35 show a comparison of the complexities of different system ar-
chitectures regarding the key equation solver components. The DI architecture has
the shortest delay when m = 5 − 7, 9, and the UP has the shortest delay when
m = 8, and for m = 10 three architectures have the same delay. Regarding the
hardware complexity of riBM cores, the DI architecture has the lowest complexity
due to only one core being used, but the UP has lower gate count compared with
the PI architecture because of the optimization for the first and last riBM cores.
For t = 3, the gate count of UP in riBM cores is around half of the PI’s, however,
for t = 4 the gap is not that significant due to one more riBM core is added between
the first core and last core. It should be noted that the area usage is inferred based
on gate counts, only providing the relative relationship between various structures.
In this design, the BCH decoders with m = 6 and t = 3, 4 are implemented.

37


4. Design

5 6 7 8 9 10

m

6

7

8

9

10

C
ri
ti
c
a

l 
P

a
th

 N
u

m
b

e
r 

o
f 

L
o

g
ic

 G
a

te
s
 (

t p
d
)

Direct Iteration Parallel Iteration Unrolled Pipeline

Figure 4.32: The comparison of logic
gates count in the critical t = 3

5 6 7 8 9 10

m

0

5

10

15

20

25

G
a
te

 C
o
u
n
t 
* 

1
0

3
 (

#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)

Direct Iteration Parallel Iteration Unrolled Pipeline

Figure 4.33: The comparison of gate
counts of riBM cores for t = 3

5 6 7 8 9 10

m

6

7

8

9

10

C
ri
ti
c
a

l 
P

a
th

 N
u

m
b

e
r 

o
f 

L
o

g
ic

 G
a

te
s
 (

t p
d
)

Direct Iteration Parallel Iteration Unrolled Pipeline

Figure 4.34: The comparison of logic
gates count in the critical t = 4

5 6 7 8 9 10

m

0

5

10

15

20

25

G
a
te

 C
o
u
n
t 
* 

1
0

3
 (

#
X

O
R

+
#
A

N
D

+
#
M

U
X

+
#
R

E
G

)

Direct Iteration Parallel Iteration Unrolled Pipeline

Figure 4.35: The comparison of gate
counts of riBM cores for t = 4

38


5
Results

This chapter begins with the synthesis results of each individual component for
t = 3, 4, and based on those components, the synthesis results of parallel iteration
(PI) and unrolled pipeline (UP) are analyzed. Then a uniform random generator
testbench is proposed to evaluate the performance of the post-synthesis circuit. In
the end, the power is assessed including FEC, and a comparison is made with the
system without FEC. All the results are synthesized in a 22-nm CMOS library using
Cadence Genus.
For the individual component, the power estimation is derived from vectorless sim-
ulation, a valuable technique for analyzing circuit behavior and performance during
the early stages of circuit design. Unlike test-vector based simulation, vectorless
simulation does not require specific input test vectors, simplifying the process and
allowing for a quicker assessment of circuit performance, such as area usage and
performance. However, the vectorless power estimation may not provide precise or
realistic results for the whole system. Therefore, for the whole system evaluation,
the power estimation is based on test-vector based simulation, which requires the
user-provided input test vectors with specific input BER.

5.1 GF Multiplier
In order to evaluate the performance of Mastrovito multiplier from [41], m = 5 − 10
multipliers are synthesized with varied timing constraints, spanning from 500MHz
to 5 GHz with a step of 100 ps. Since the synthesized area usage is almost the same
regardless of the timing constraint, only the area usage at 1 GHz is presented. In
addition to the current multiplier, the synthesis results of the modular reduction
multiplier [42], used in a previous design [8], are also provided for comparison (see
figure 5.1).
For the timing (which has not been presented here), the Mastrovito multipliers
have almost the same time performance as the modular reduction ones. The reason
is that the delay of one or two logic gates can not lead to considerably different
timing results compared with the synthesis timing step of 100 ps. All multipliers
can be successfully synthesized up to 5 GHz, however, aiming at a small area usage,
the synthesis clock frequency should be lower than 5 GHz. Regarding area usage,
the performance of Mastrovito multipliers is also similar to the modular reduction
multipliers, except for the multipliers with m = 8. The reason for that is the m = 8
multipliers utilize a pentanomial irreducible polynomial and the modular reduction

39


5. Results

5 6 7 8 9 10

m

0

20

40

60

80

100

120

140

160

180

200
a

re
a

 (
u

m
2
)

Modular Reduction

Mastrovito Multiplier

Figure 5.1: The area usage of the Mas-
trovito and modular reduction multiplier

5 6 7 8 9 10

m

0

10

20

30

40

50

60

70

to
ta

l 
p

o
w

e
r 

(u
W

)

Modular Reduction

Mastrovito Multiplier

Figure 5.2: The power consumption of
the Mastrovito and modular reduction
multiplier

multipliers may perform better than the Mastrovito multiplier considering the area
usage. The synthesis result is almost equivalent to the result of the evaluation model
in figures 4.5 and 4.6, with the prediction that the Mastrovito multipliers have the
same area usage as the modular reduction ones.
The power results are also presented in figure 5.2. Interestingly, the Mastrovito
multiplier with m = 8 consumes more power than the one with m = 9 because of
the pentanomial polynomial for m = 8. The power consumption increase follows m
due to more XOR gates and input pins required.

5 6 7 8 9 10

m

0

2000

4000

6000

8000

10000

12000

a
re

a
 (

u
m

2
)

t=3

t=4

Figure 5.3: The area usage of t = 3, 4
syndrome components

5 6 7 8 9 10

m

0

500

1000

1500

2000

2500

3000

to
ta

l 
p

o
w

e
r 

(u
W

)

t=3

t=4

Figure 5.4: The power consumption of
t = 3, 4 syndrome components

5.2 Syndrome Component
Like the synthesis of the GF multipliers, the syndrome components are also syn-
thesized from 500 MHz to 5 GHz. The syndrome components with t = 3, 4 and

40


5. Results

m = 5 − 10 are synthesized, and their area usage is provided in figure 5.3 for
t = 3, 4. Similar to the result of the prediction model in chapter 4, the synthesis
area utilization demonstrates an exponential increase as m increases. Although large
m will lead to high area usage, a design based on large m provides high throughput.
In addition, m has a more significant influence on the area usage compared with
error correction ability t, especially when m is large since the pins will double when
m increases by one.
The result of the power consumption follows the same trend as the result of the
area usage (see figure 5.4). The power consumption increases as m increases. Sub-
sequently, in order to reduce area usage and limit power consumption, the design
with a small value of m should be implemented.

5.3 Key Equation Solver
The performance of the different key equation solvers based on the riBM algorithm
is presented in this section. The synthesis frequency range is identical to the range of
syndrome components and GF multipliers and those equation solvers are synthesized
from m = 5 to m = 10. Due to the high latency of direct iteration (DI) architecture,
the key equation solver with that architecture is not synthesized. Only unrolled
pipeline (UP) and iteration parallel (IP) architectures are presented. It should be
noted that for m = 10, the syndrome component cannot be successfully synthesized
at 200 ps due to the long critical path, which is consistent with the result in chapter
4, the syndrome component dominates the critical path at m = 10.

5 6 7 8 9 10

m

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

a
re

a
 (

u
m

2
)

Unrolled Pipeline (UP)

Iteration Parallel (IP)

Figure 5.5: The area usage of the t = 3
riBM components

5 6 7 8 9 10

m

0

5000

10000

15000

a
re

a
 (

u
m

2
)

Unrolled Pipeline (UP)

Iteration Parallel (IP)

Figure 5.6: The area usage of the t = 4
riBM components

5.3.1 t = 3 riBM Component
The synthesis area usages of UP riBM and IP riBM are shown in figure 5.5. The area
usage of IP riBM is around five times larger than the UP riBM, which indicated an
increase of about two times compared with the result of the estimation model. The
reasons are the estimation model does not include the input and output switching

41


5. Results

MUXs in IP and neglects the influence of net connections. The estimation result is
also based on the gate count, not including the actual area utilization for each gate.
In addition, the area usage increases following the trends of GF multipliers because
they are the main components for the riBM cores. For the IP riBM components, the
circuit failed to be synthesized at 200 ps because of the longer critical path caused
by pentanomial and the output MUXs.
The power estimation is based on vectorless simulation, which is not that precise
and interesting (see figures 5.7 and 5.8). However, in general, power consumption
follows the same trend as area usage, and when the clock frequency is reduced, the
power consumption decreases.

200 400 600 800 1000 1200 1400 1600 1800 2000

clock period (ps)

0

2000

4000

6000

8000

10000

12000

to
ta

l 
p

o
w

e
r 

(u
W

)

m=5

m=6

m=7

m=8

m=9

m=10

Figure 5.7: The power consumption of
the t = 3 UP riBM component

200 400 600 800 1000 1200 1400 1600 1800 2000

clock period (ps)

0

1

2

3

4

5

6

to
ta

l 
p
o
w

e
r 

(u
W

)

10
4

m=5

m=6

m=7

m=8

m=9

m=10

Figure 5.8: The power consumption of
the t = 3 IP riBM component

200 400 600 800 1000 1200 1400 1600 1800 2000

clock period (ps)

0

0.5

1

1.5

2

2.5

to
ta

l 
p

o
w

e
r 

(u
W

)

10
4

m=5

m=6

m=7

m=8

m=9

m=10

Figure 5.9: The power consumption of
the t = 4 UP riBM component

200 400 600 800 1000 1200 1400 1600 1800 2000

clock period (ps)

0

2

4

6

8

10

12

to
ta

l 
p
o
w

e
r 

(u
W

)

10
4

m=5

m=6

m=7

m=8

m=9

m=10

Figure 5.10: The power consumption of
the t = 4 IP riBM component

5.3.2 t = 4 riBM Component
The hardware complexity of t = 4 riBM components is presented in figure 5.6.
The ratio of IP over UP decreases to around 4 because of two inner cores added,

42


5. Results

and this trend is the same as the result of the estimation model. The estimation
model has not included the influence of input and output MUXs, the influence of net
connection, and the actual area usage of different gates, which are the reason why
the ratio of the estimation model is lower than the actual results. In addition, the
area usage increases following the trends from the GF multipliers because it is the
main component for the riBM cores. For the IP riBM components, the circuit failed
to be synthesized at 200 ps because of the long critical path caused by pentanomial
and the output MUXs.
Since the power estimation is based on vectorless simulation, it is not very precise
but gives some indication (see figures 5.9 and 5.10). However, in general, power
consumption follows the same trend as area usage, and when the clock frequency is
reduced, the power consumption decreases.

5.4 Chien Search
In the estimation model, the Chien search components are the most hardware-
consuming. In order to verify this result, the Chien search components are syn-
thesized with t = 3, 4 from m = 5 to m = 10. The synthesis results are stated in
figures 5.11 and 5.12.

5 6 7 8 9 10

m

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

a
re

a
 (

u
m

2
)

10
4

t=3

t=4

Figure 5.11: The area usage of Chien
search components

5 6 7 8 9 10

m

0

2000

4000

6000

8000

10000

12000

14000

to
ta

l 
p

o
w

e
r 

(u
W

)

t=3

t=4

Figure 5.12: The power consumption of
Chien search components

Based on the results in figure 5.11, it can be observed that the area usage exhibits
an exponential increase as m increases. This trend is almost consistent with the
prediction models. Furthermore, the synthesis results of Chien search support the
argument that Chien search component dominates when m ≥ 6 for t = 3 in terms
of the hardware complexity. For m ≥ 9, the component can not be synthesized
at 200 ps because more XOR gates are required to calculate the roots. The power
consumption shown in figure 5.12 also follows the trend of the area usage, and the
Chien search component is the most power-expensive component for m ≥ 9. In order
to achieve the low-power consumption goal, the Chien search component should be
disabled for received error-free codewords. Therefore clock gating technology is

43


5. Results

required to enable and disable the Chien search component and to reduce power
consumption.

5.5 BSC Simulation
In order to reduce the timing and hardware complexity, only m = 6 decoders are
implemented based on the iteration parallel (IP) and unrolled pipeline (UP) system
architecture. The decoders are simulated using a binary symmetric channel model
to precisely evaluate the power consumption and output BER. The relationships of
input and output BER are shown in figures 5.13 and 5.14. The prediction curves
are generated by MATLAB with the theoretical formula for the BCH decoder, and
the simulation curves are the resulting output of the post-synthesis netlists.

10
-5

10
-4

10
-3

10
-2

10
-1

10
0

Input BER

10
-12

10
-10

10
-8

10
-6

10
-4

10
-2

10
0

O
u

tp
u

t 
B

E
R

Predication Curve

Simulation Curve

Figure 5.13: Predicated and simulated
curve for t = 3, m = 6 decoder

10
-4

10
-3

10
-2

10
-1

10
0

Input BER

10
-12

10
-10

10
-8

10
-6

10
-4

10
-2

10
0

O
u

tp
u

t 
B

E
R

Predication Curve

Simulation Curve

Figure 5.14: Predicated and simulated
curve for t = 4, m = 6 decoder

It can be concluded that the simulation curves follow the prediction curves. The
simulation result demonstrates a lower output BER compared to the predicted val-
ues. This improvement can be attributed to the ability of the decoder to detect
more errors than it can correct. When it detects more errors than its correction
ability, it avoids introducing additional errors to the noise code, resulting in a lower
output BER compared to the prediction curve. The lift at the end of the simulated
curve is caused by insufficient test errors. A low input BER means not many errors
are presented in the received codewords and the decoder can correct most of them,
resulting in a clean output. Thus depending on the throughput, it may take a long
time, such as one week, to get enough errors in the output. The t = 3, m = 6
decoder possesses the capability to reduce the input of BER −3 dB by around 3 dB
to −6 dB, relaxing the OMA requirement at the transmitter.
The VCD-based power simulation is employed in order to evaluate the practical
performance of the decoder. The VCD files are generated according to specific
input BER and contain the actual signal activities. Both the IP and UP system
architectures are simulated, and the results are presented in figures 5.15 and 5.16.
The proposed simplified UP decoders offer a notable power saving of up to 30%
compared with the IP decoder at an input BER of −1 dB. For the lower input BER,

44


5. Results

10
-3

10
-2

10
-1

Input BER

4

5

6

7

8

9

10

11

12

13

14
T

o
ta

l 
P

o
w

e
r 

(m
W

)

Unrolled Pipeline

Iteration Parallel

Figure 5.15: Power consumption for
t = 3, m = 6 IP and UP decoders

10
-3

10
-2

10
-1

Input BER

4

6

8

10

12

14

16

18

20

22

24

T
o

ta
l 
P

o
w

e
r 

(m
W

)

Unrolled Pipeline

Iteration Parallel

Figure 5.16: Power consumption for
t = 4, m = 6 IP and UP decoders

such as −3 dB, the riBM cores are rarely used since they can be clock gated when
an error-free block is received. Therefore, the power saving is not that significant,
only 20% percent. Compared with the t = 3 decoder, the power consumption of
t = 4 increases to around 1.5 times due to the more complex circuit.

10
-3

10
-2

10
-1

Input BER

0.4

0.5

0.6

0.7

0.8

0.9

1

R
e
la

ti
v
e
 p

o
w

e
r 

c
o
n
s
u
m

p
ti
o
n

TX 20 mW TX 72 mW TX 120 mW TX 190 mW

Figure 5.17: Relative system power
consumption after adding FEC for t = 3
UP decoder

10
-3

10
-2

10
-1

Input BER

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

R
e
la

ti
v
e
 p

o
w

e
r 

c
o
n
s
u
m

p
ti
o
n

TX 20 mW TX 72 mW TX 120 mW TX 190 mW

Figure 5.18: Relative system power
consumption after adding FEC for t = 4
UP decoder

5.6 Post-FEC Power Saving
Introducing FEC may help to reduce power consumption in short-haul fiber-optic in-
terconnects. In order to evaluate the total power consumption of the whole FEC, the
t = 3, 4 encoders are also implemented, and switching activity is recorded and back-
annotated to the netlist to perform more accurate power-dissipation estimation.
Moreover, four transmitters [29, 30, 31, 32] are incorporated to simulate real-world
transceiver systems. Figures 5.17 and 5.18 show the relative power consumption of

45


5. Results

systems with FEC over systems without FEC under the 3.5-dB relaxation and 4-dB
relaxation, respectively. For an ultra-low power transmitter system, the decoder can
save a maximum of 29% power compared with the uncoded system. However, the
power reduction is not significant and even the power consumption increases at a
high input BER with a t = 4 FEC. Introducing FEC to a high power consumption
transmitter system, such as the 72-mW, 120-mW, and 190-mW transmitter systems,
results in more power saving. For example, incorporating a t = 4 FEC can relax
the power dissipation up to 58% in a 190-mW transmitter system but with a low
coding gain. A higher coding gain t = 3 FEC can still, however, offer 52% power
dissipation reduction.

46


6
Conclusion

In this thesis, low-power BCH FECs are implemented to both reduce output BER
and increase the data rate. Suitable FECs can also offer power relaxation for the
whole fiber system. In an FEC, the decoder is typically more complex in comparison
to the encoder and its complexity varies based on decoding algorithms. To identify
appropriate decoders, an estimation model is proposed to assess the performance
regarding timing and area usage based on gate count. It can correctly estimate
the relationship between different key equation solver algorithms and different sys-
tem architectures. With the estimation results, the riBM algorithm is employed
for the key equation solver component, while the UP architecture is chosen for the
overall system. By unrolling and optimizing the iterations, the unrolled (UP) de-
coder demonstrates a 20% reduction in power consumption compared to the parallel
iteration (PI) decoder.
To further reduce the power consumption of decoders, clocking gating is employed
to disable the key equation solver and the hardware-consuming Chien search com-
ponent if error-free codewords are received. BCH decoders with t = 3, 4 and m = 6
are finally implemented on a 22-nm FDSOI CMOS technology in order to provide
sufficient error correction ability and to limit area usage. A binary symmetric chan-
nel (BSC) simulator is implemented to validate that the output BER of the RTL
netlist follows the trend of the predicted curve.
Four realistic transmitters are considered to evaluate the power relaxation and the
corresponding encoders are also implemented to estimate the overall power con-
sumption of BCH FEC. The best implementation (UP m = 6 and t = 3) can offer a
maximum 58% power saving compared with the uncoded system at an input BER of
10−3 in a 190-mW transmitter system. Even for the ultra-low (20-mW) power con-
sumption transmitter, the best implementation can still provide 29% power reduc-
tion. By integrating suitable low-power BCH decoders, short-reach optical systems
can operate at a power-efficient point, thereby mitigating some of the problems of
high-temperature environments. Additionally, the inclusion of FECs enhances the
resilience of systems by providing error correction abilities.

47


6. Conclusion

48


Bibliography

[1] J. A. Tatum, D. Gazula, L. A. Graham, J. K. Guenter, R. H. Johnson, J. King,
C. Kocot, G. D. Landry, I. Lyubomirsky, A. N. MacInnes, E. M. Shaw, K. Bale-
marthy, R. Shubochkin, D. Vaidya, M. Yan, and F. Tang, “VCSEL-based inter-
connects for current and future data centers,” Journal of Lightwave Technology,
vol. 33, no. 4, pp. 727–732, 2015.

[2] D. M. Kuchta, A. V. Rylyakov, C. L. Schow, J. E. Proesel, C. W. Baks, P. West-
bergh, J. S. Gustavsson, and A. Larsson, “A 50 Gb/s NRZ modulated 850 nm
VCSEL transmitter operating error free to 90 °C,” Journal of Lightwave Tech-
nology, vol. 33, no. 4, pp. 802–810, 2015.

[3] Y. Lee, H. Yoo, I. Yoo, and I.-C. Park, “6.4Gb/s multi-threaded BCH encoder
and decoder for multi-channel SSD controllers,” in 2012 IEEE International
Solid-State Circuits Conference, 2012, pp. 426–428.

[4] W. Xueqiang, P. Liyang, W. Dong, H. Chaohong, and Z. Runde, “A high-speed
two-cell BCH decoder for error correcting in MLC NOR flash memories,” IEEE
Transactions on Circuits and Systems II: Express Briefs, vol. 56, no. 11, pp.
865–869, 2009.

[5] K. Lee, S. Lim, and J. Kim, “Low-cost, low-power and high-throughput BCH
decoder for NAND flash memory,” in 2012 IEEE International Symposium on
Circuits and Systems (ISCAS), 2012, pp. 413–415.

[6] G. Tzimpragos, C. Kachris, I. B. Djordjevic, M. Cvijetic, D. Soudris, and
I. Tomkos, “A survey on FEC codes for 100 G and beyond optical networks,”
IEEE Communications Surveys & Tutorials, vol. 18, no. 1,