Digital Audio Interface Jitter
Master’s thesis in Electrical Engineering

FREDRIK SINKKONEN

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024


Master’s thesis 2024

Digital Audio Interface Jitter

FREDRIK SINKKONEN

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2024


Digital Audio Interface Jitter
FREDRIK SINKKONEN

© FREDRIK SINKKONEN, 2024.

Supervisor: Morten Fjeld, Department of Computer Science and Engineering
Examiner: Morten Fjeld, Department of Computer Science and Engineering

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Digital Audio Interface Jitter
FREDRIK SINKKONEN
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Jitter is the short-term deviation of a digital signal from its ideal position in time.
Some common issues know to produce jitter in currently used digital audio inter-
face formats were examined and multiple implementations of a Universal Serial Bus
(USB) audio interface were designed with the intention of creating a device free from
interface jitter. Using the three standardized clock synchronization mechanisms in
the USB protocol for isochronous transmissions and a selection of suitable clock
sources, USB audio class devices were created for which jitter measurements then
were performed. The results were compared with jitter audibility thresholds from
three studies containing listening tests. While all implementations were functionally
acceptable, their jitter results did differ. For the two isochronous synchronization
modes of USB that require a continuously adjustable clock source on the receiving
side of the interface the jitter issue consists of two parts. Periodic adjustments of the
clock signal are in itself a source of jitter and the way in which an adjustable clock
source is constructed is another. The initial core idea was that a USB audio inter-
face using isochronous transfers coupled with the asynchronous clock synchroniza-
tion mode and a fixed frequency clock source would be able to provide an interface
in which no additional jitter on top of the inherent jitter level of the source clock
would be added by the transfer of data over the interface. The two fixed frequency
clocks that were used did however not perform any better than the results of the
best adjustable clock source and when they were attached to the test system their
jitter levels increased even further. Analysis of the jitter measurements point in
the direction of asynchronous mode being preferable for lowest possible jitter levels
but the results are not completely unambiguous and jitter levels below the lowest
recorded hearing thresholds were also achieved with one of the other synchronization
modes for isochronous USB transfers.

Keywords: Asynchronous, Audio, Clock, DAC, Digital, Interface, Jitter, PSoC,
S/PDIF, USB.

v


Contents

List of Figures xi

List of Tables xvii

Terms and Abbreviations xix

1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . 1
1.2 Scope and Delimitations . . . . . . . . . . . . . . . . . . . . . 2
1.3 Functional Requirements . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Definition of Jitter . . . . . . . . . . . . . . . . . . . . 5
2.1.2 What Causes Jitter? . . . . . . . . . . . . . . . . . . . 7
2.1.3 Probability Theory for Jitter Distributions . . . . . . . 7
2.1.4 Jitter Types . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4.1 Random Jitter . . . . . . . . . . . . . . . . . 11
2.1.4.2 Periodic Jitter . . . . . . . . . . . . . . . . . 12
2.1.4.3 Data Dependent Jitter . . . . . . . . . . . . . 12
2.1.4.4 Duty Cycle Distortion . . . . . . . . . . . . . 13
2.1.4.5 Bounded Uncorrelated Jitter . . . . . . . . . 14

2.1.5 Audibility of Jitter . . . . . . . . . . . . . . . . . . . . 14
2.1.5.1 A Theoretical Jitter Audibility Model . . . . 19

2.2 AES/EBU and S/PDIF . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Biphase Mark Code . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Clock Recovery . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Asymmetric Slew Rates . . . . . . . . . . . . . . . . . 24
2.2.4 Transmission Lines . . . . . . . . . . . . . . . . . . . . 25
2.2.5 FIFO Buffers . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Universal Serial Bus . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Network Topology . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Connecting a Device to the Bus . . . . . . . . . . . . . 30
2.3.3 Descriptors . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Device Classes and Device Requests . . . . . . . . . . . 32
2.3.5 Transfer Types . . . . . . . . . . . . . . . . . . . . . . 33

vii


Contents

2.3.5.1 Control Transfers . . . . . . . . . . . . . . . . 33
2.3.5.2 Isochronous Transfers . . . . . . . . . . . . . 33

2.3.6 Time Units . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.7 Bus Access Period . . . . . . . . . . . . . . . . . . . . 34
2.3.8 Endpoint Buffering . . . . . . . . . . . . . . . . . . . . 35
2.3.9 Prebuffering Delay . . . . . . . . . . . . . . . . . . . . 35
2.3.10 Transfer of Data . . . . . . . . . . . . . . . . . . . . . 36

2.3.10.1 Packet Types . . . . . . . . . . . . . . . . . . 36
2.3.10.2 Packet Fields . . . . . . . . . . . . . . . . . . 37

2.3.11 Isochronous Synchronization Types . . . . . . . . . . . 41
2.3.11.1 Synchronous . . . . . . . . . . . . . . . . . . 41
2.3.11.2 Adaptive . . . . . . . . . . . . . . . . . . . . 41
2.3.11.3 Asynchronous . . . . . . . . . . . . . . . . . . 41

2.3.12 Explicit Feedback . . . . . . . . . . . . . . . . . . . . . 41
2.3.13 The Audio Device Class . . . . . . . . . . . . . . . . . 43

2.3.13.1 Clocks, Time and Synchronization . . . . . . 43
2.3.13.2 Entities . . . . . . . . . . . . . . . . . . . . . 44
2.3.13.3 Audio Class Descriptors . . . . . . . . . . . . 44
2.3.13.4 Audio Class Requests . . . . . . . . . . . . . 46
2.3.13.5 Audio Class Definition 2.0 . . . . . . . . . . . 46

2.4 Inter-IC Sound . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5 Frequency Synthesizers . . . . . . . . . . . . . . . . . . . . . . 48
2.6 Fractional Dividers . . . . . . . . . . . . . . . . . . . . . . . . 49

3 System Design and Implementation 51
3.1 Hardware Selection . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Development Environment . . . . . . . . . . . . . . . . . . . . 52

3.2.1 Monitoring of Device Operation . . . . . . . . . . . . . 52
3.2.2 The PSoC Clock System . . . . . . . . . . . . . . . . . 52

3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Common Design Layout . . . . . . . . . . . . . . . . . 54
3.3.2 Asynchronous Mode Implementation . . . . . . . . . . 57
3.3.3 Adaptive Mode Implementation . . . . . . . . . . . . . 62
3.3.4 Synchronous Mode Implementation . . . . . . . . . . . 67

4 Results 71
4.1 Functional Results . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1.1 Functional Results for Asynchronous Mode . . . . . . . 71
4.1.2 Functional Results for Adaptive Mode . . . . . . . . . 76
4.1.3 Functional Results for Synchronous Mode . . . . . . . 80

4.2 Jitter Measurements . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.1 Discussion of Measurement Results . . . . . . . . . . . 84
4.2.2 Start-of-Frame Jitter . . . . . . . . . . . . . . . . . . . 92

5 Conclusion 95
5.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 95

5.1.1 Sustainability and Environmental Considerations . . . 97

viii


Contents

Bibliography 99

A USB Descriptors I
A.1 Asynchronous Mode USB Descriptor Table . . . . . . . . . . . I
A.2 Adaptive Mode USB Descriptors . . . . . . . . . . . . . . . . IV
A.3 Synchronous Mode USB Descriptors . . . . . . . . . . . . . . . V

B Register Maps for Si5351 VII
B.1 Register Map Generated by ClockBuilder Pro . . . . . . . . . VII
B.2 Manually Generated Register Values . . . . . . . . . . . . . . VIII

C Jitter Histograms XI
C.1 Asynchronous Mode Jitter Histograms . . . . . . . . . . . . . XI

C.1.1 Asynchronous Mode with Si5351 Integer Multisynth . . XI
C.1.2 Asynchronous Mode with External Crystal Oscillator . XIV
C.1.3 Asynchronous Mode with Fixed Frequency Clock . . . XVI
C.1.4 Asynchronous Mode with Custom Fractional Divider . XIX
C.1.5 Asynchronous Mode with Si5351 Integer Multisynth . . XXI

C.2 Adaptive Mode Jitter Histograms . . . . . . . . . . . . . . . . XXIV
C.2.1 Adaptive Mode with Si5351 Integer Multisynth . . . . XXIV
C.2.2 Adaptive Mode with Si5351 Fractional Multisynth . . . XXVI

C.3 Synchronous Mode Jitter Histograms . . . . . . . . . . . . . . XXIX
C.3.1 Synchronous Mode with Custom Fractional Divider . . XXIX

ix


Contents

x


List of Figures

2.1 Jitter assessment for a clock signal having a reference. . . . . . 6
2.2 Jitter assessment for a solitary clock signal. . . . . . . . . . . . 6
2.3 Visualization of how noise in the voltage domain can produce

jitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Theoretical example of a cumulative distribution function for a

clock signal with ideal transition time at τi. . . . . . . . . . . 8
2.5 The probability density function corresponding to the cumulative

distribution function in Figure 2.4. . . . . . . . . . . . . . . . 9
2.6 The probability density function from Figure 2.5 divided into

time brackets with an arbitrary point in time τp selected. . . . 9
2.7 A closeup of the probability density function in Figure 2.6 around

the arbitrarily selected point in time τp. . . . . . . . . . . . . . 10
2.8 Jitter components contributing to total jitter. . . . . . . . . . 11
2.9 Probability density function for random jitter. . . . . . . . . . 11
2.10 The probability density function for a sinusoidal periodic jitter

distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.11 Typical probability density function for data dependent jitter. 13
2.12 Probability density function for duty cycle distortion. . . . . . 14
2.13 The size of an amplitude error caused by a timing error depends

on the slope of the signal. . . . . . . . . . . . . . . . . . . . . 15
2.14 Analog sine wave and the resulting waveform after quantization

in low resolution. . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.15 Data structure of AES/EBU and S/PDIF. The contents of a) an

audio block, b) a frame and c) a subframe. . . . . . . . . . . . 21
2.16 Biphase mark code timing diagram. . . . . . . . . . . . . . . . 22
2.17 First order passive low pass filter used to simulate bandwidth

limited transmission channel. . . . . . . . . . . . . . . . . . . 23
2.18 Transmission of one subframe over bandwidth limited channel. 23
2.19 First eight bits of the subframe. . . . . . . . . . . . . . . . . . 24
2.20 Bits four and five of the subframe. . . . . . . . . . . . . . . . . 24
2.21 Symmetric versus asymmetric slew rate response to an ideal square

wave. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.22 Cascaded network model for high frequency transmission line. 25
2.23 Transmission line with load attached. . . . . . . . . . . . . . . 26
2.24 High frequency voltage pulse applied to a transmission line. . . 27
2.25 USB topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xi


List of Figures

2.26 Device state changes during enumeration process. . . . . . . . 31
2.27 Configuration, interface and endpoint descriptor structure. . . 32
2.28 Delay induced due to prebuffering at endpoints. . . . . . . . . 36
2.29 Example of NRZI encoded data. . . . . . . . . . . . . . . . . . 36
2.30 Isochronous USB OUT transaction sequence. . . . . . . . . . . 39
2.31 USB OUT token packet. . . . . . . . . . . . . . . . . . . . . . 39
2.32 USB DATAx packet. . . . . . . . . . . . . . . . . . . . . . . . 39
2.33 USB SOF packet transaction sequence. . . . . . . . . . . . . . 39
2.34 USB SOF packet. . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.35 USB control transfer sequence. . . . . . . . . . . . . . . . . . . 40
2.36 USB SETUP packet used in a control transfer. . . . . . . . . . 40
2.37 USB DATAx packet used in a control transfer. . . . . . . . . . 40
2.38 USB ACK handshake packet used in a control transfer. . . . . 40
2.39 Feedback format for full-speed endpoint. . . . . . . . . . . . . 43
2.40 Feedback format for high-speed endpoint. . . . . . . . . . . . . 43
2.41 I2S transmitter and receiver pair with the transmitter having the

role of the controller. . . . . . . . . . . . . . . . . . . . . . . . 47
2.42 I2S interface timing diagram. . . . . . . . . . . . . . . . . . . . 48
2.43 Block diagram of a phase-locked loop. . . . . . . . . . . . . . . 49
2.44 Example of fractional division by 2.4. . . . . . . . . . . . . . . 50

3.1 The PSoC core clocking network. . . . . . . . . . . . . . . . . 53
3.2 Layout of the audio path and the physical entities. . . . . . . 55
3.3 General transfer sequence for USB audio transmissions. . . . . 55
3.4 Audio data transfer path inside the PSoC device. . . . . . . . 56
3.5 I2C single register write operation. . . . . . . . . . . . . . . . 57
3.6 I2C burst write operation to two consecutive registers. . . . . . 58
3.7 Programming procedure for the Adafruit Si5351 external clock

generator board. . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Feedback array with integer part populated. . . . . . . . . . . 59
3.9 Bit values around the decimal point in the feedback array. . . 60
3.10 Feedback array with the fractional part populated. . . . . . . 61
3.11 The complete USB feedback array. . . . . . . . . . . . . . . . 61
3.12 Crystal oscillator circuit used for generating external clock. . . 62
3.13 Si5351 block diagram. . . . . . . . . . . . . . . . . . . . . . . 63
3.14 Clock configuration for the synchronous mode implementation. 69

4.1 Buffer fill level during four hour long playback session using asyn-
chronous mode. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2 Accumulative error without correction of buffer fill. . . . . . . 74
4.3 Accumulative error with correction of buffer fill. . . . . . . . . 74
4.4 Accumulative error without correction and with increased num-

ber of buffer chunks. . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Increasing the number of buffer chunks to twenty for asynchronous

mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Buffer fill level for USB in adaptive mode with the multisynth in

integer mode without clock adjustment. . . . . . . . . . . . . . 78

xii


List of Figures

4.7 Buffer fill level for USB in adaptive mode with the multisynth in
integer mode and clock adjustment activated. . . . . . . . . . 78

4.8 Buffer fill level for USB in adaptive mode with the multisynth in
fractional mode without clock adjustment. . . . . . . . . . . . 79

4.9 Buffer fill level for USB in adaptive mode with the multisynth in
fractional mode and clock adjustment activated. . . . . . . . . 79

4.10 Buffer fill level for USB in synchronous mode with frequency
updating turned on. . . . . . . . . . . . . . . . . . . . . . . . . 81

4.11 Buffer fill level for USB in synchronous mode with frequency
updates turned off. . . . . . . . . . . . . . . . . . . . . . . . . 82

4.12 PSoC IMO clock pulses registered per SOF. . . . . . . . . . . 82
4.13 Capture of one measurement each for the I2S output signals. . 84
4.14 Capture of one measurement for the I2S input signal. . . . . . 84
4.15 Period peak-to-peak jitter for the I2S component input and SCK

and WS output clocks for each of the implementations. . . . . 88
4.16 Cycle-to-cycle peak jitter for the I2S component input and SCK

output clocks for each of the implementations. . . . . . . . . . 88
4.17 Period histogram for two adjacent adjustment levels for the I2S

component input clocks in USB adaptive mode with the multi-
synth set to integer mode. . . . . . . . . . . . . . . . . . . . . 89

4.18 Period histograms with and without clock adjustments for the
I2S component input clock in USB asynchronous mode using the
IMO together with the fractional divider as clock source. . . . 91

4.19 The range of period jitter peak-to-peak values for all three I2S
component clocks compared to the theoretical jitter audibility
model and the jitter audibility thresholds determined by listening
tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.20 Start-of-frame packets received by the PSoC USBFS component. 93
4.21 Histogram of SOF arrival time variation referenced to IMO clock

data from Figure 4.12. . . . . . . . . . . . . . . . . . . . . . . 93

C.1 I2S component input clock period jitter histogram for asynchronous
mode USB with the Si5351 multisynth set to fractional mode. XI

C.2 I2S component input clock cycle-to-cycle jitter histogram for asyn-
chronous mode USB with the Si5351 multisynth set to fractional
mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XII

C.3 I2S SCK signal period jitter histogram for asynchronous mode
USB with the Si5351 multisynth set to fractional mode. . . . . XII

C.4 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous
mode USB with the Si5351 multisynth set to fractional mode. XIII

C.5 I2S WS signal period jitter histogram for asynchronous mode
USB with the Si5351 multisynth set to fractional mode. . . . . XIII

C.6 I2S component input clock period jitter histogram for asynchronous
mode USB using the external XO as source clock. . . . . . . . XIV

C.7 I2S component input clock cycle-to-cycle jitter histogram for asyn-
chronous mode USB using the external XO as source clock. . . XIV

xiii


List of Figures

C.8 I2S SCK signal period jitter histogram for asynchronous mode
USB using the external XO as source clock. . . . . . . . . . . XV

C.9 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous
mode USB using the external XO as source clock. . . . . . . . XV

C.10 I2S WS signal period jitter histogram for asynchronous mode
USB using the external XO as source clock. . . . . . . . . . . XVI

C.11 I2S component input clock period jitter histogram for asynchronous
mode USB using the external fixed frequency clock board as
source clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVI

C.12 I2S component input clock cycle-to-cycle jitter histogram for asyn-
chronous mode USB using the external fixed frequency clock
board as source clock. . . . . . . . . . . . . . . . . . . . . . . XVII

C.13 I2S SCK signal period jitter histogram for asynchronous mode
USB using the external fixed frequency clock board as source
clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVII

C.14 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous
mode USB using the external fixed frequency clock board as
source clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVIII

C.15 I2S WS signal period jitter histogram for asynchronous mode
USB using the external fixed frequency clock board as source
clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVIII

C.16 I2S component input clock period jitter histogram for asynchronous
mode USB using the IMO as source clock together with the frac-
tional divider component. . . . . . . . . . . . . . . . . . . . . XIX

C.17 I2S component input clock cycle-to-cycle jitter histogram for asyn-
chronous mode USB using the IMO as source clock together with
the fractional divider component. . . . . . . . . . . . . . . . . XIX

C.18 I2S SCK signal period jitter histogram for asynchronous mode
USB using the IMO as source clock together with the fractional
divider component. . . . . . . . . . . . . . . . . . . . . . . . . XX

C.19 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous
mode USB using the IMO as source clock together with the frac-
tional divider component. . . . . . . . . . . . . . . . . . . . . XX

C.20 I2S WS signal period jitter histogram for asynchronous mode
USB using the IMO as source clock together with the fractional
divider component. . . . . . . . . . . . . . . . . . . . . . . . . XXI

C.21 I2S component input clock period jitter histogram for asynchronous
mode USB with the Si5351 multisynth set to integer mode. . . XXI

C.22 I2S component input clock cycle-to-cycle jitter histogram for asyn-
chronous mode USB with the Si5351 multisynth set to integer
mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXII

C.23 I2S SCK signal period jitter histogram for asynchronous mode
USB with the Si5351 multisynth set to integer mode. . . . . . XXII

C.24 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous
mode USB with the Si5351 multisynth set to integer mode. . . XXIII

xiv


List of Figures

C.25 I2S WS signal period jitter histogram for asynchronous mode
USB with the Si5351 multisynth set to integer mode. . . . . . XXIII

C.26 I2S component input clock period jitter histogram for adaptive
mode USB with the Si5351 multisynth set to integer mode. . . XXIV

C.27 I2S component input clock cycle-to-cycle jitter histogram for adap-
tive mode USB with the Si5351 multisynth set to integer mode. XXIV

C.28 I2S SCK signal period jitter histogram for adaptive mode USB
with the Si5351 multisynth set to integer mode. . . . . . . . . XXV

C.29 I2S SCK signal cycle-to-cycle jitter histogram for adaptive mode
USB with the Si5351 multisynth set to integer mode. . . . . . XXV

C.30 I2S WS signal period jitter histogram for adaptive mode USB
with the Si5351 multisynth set to integer mode. . . . . . . . . XXVI

C.31 I2S component input clock period jitter histogram for adaptive
mode USB with the Si5351 multisynth set to fractional mode. XXVI

C.32 I2S component input clock cycle-to-cycle jitter histogram for adap-
tive mode USB with the Si5351 multisynth set to fractional mode. XXVII

C.33 I2S SCK signal period jitter histogram for adaptive mode USB
with the Si5351 multisynth set to fractional mode. . . . . . . . XXVII

C.34 I2S SCK signal cycle-to-cycle jitter histogram for adaptive mode
USB with the Si5351 multisynth set to fractional mode. . . . . XXVIII

C.35 I2S WS signal period jitter histogram for adaptive mode USB
with the Si5351 multisynth set to fractional mode. . . . . . . . XXVIII

C.36 I2S input clock period jitter histogram for synchronous mode
USB using the IMO as source clock with the fractional divider
component. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXIX

C.37 I2S input clock cycle-to-cycle jitter histogram for synchronous
mode USB using the IMO as source clock with the fractional
divider component. . . . . . . . . . . . . . . . . . . . . . . . . XXIX

C.38 I2S SCK signal period jitter histogram for synchronous mode
USB using the IMO as source clock with the fractional divider
component. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXX

C.39 I2S SCK signal cycle-to-cycle jitter histogram for synchronous
mode USB using the IMO as source clock with the fractional
divider component. . . . . . . . . . . . . . . . . . . . . . . . . XXX

C.40 I2S WS signal period jitter histogram for synchronous mode USB
using the IMO as source clock with the fractional divider com-
ponent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXI

xv


List of Figures

xvi


List of Tables

2.1 Range of calculated jitter audibility thresholds for all test partic-
ipants when playing sine wave tones with added sinusoidal jitter. 17

2.2 Range of jitter audibility thresholds recorded for all test par-
ticipants when playing program material with added sinusoidal
jitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Proportion of test participants that were able to hear the effects
of random jitter added to self selected source material. . . . . 18

2.4 Description of subframe data fields. . . . . . . . . . . . . . . . 21
2.5 Data transfer types for USB. . . . . . . . . . . . . . . . . . . . 33
2.6 Packet ID sequencing for a high-speed high bandwidth device

receiving isochronous data from host. . . . . . . . . . . . . . . 35
2.7 USB token packets. . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 USB handshake packets. . . . . . . . . . . . . . . . . . . . . . 37
2.9 USB packet fields. . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 The fixed frequencies at which the IMO can be operated at. . 53
3.2 Calculation of the fractional bit values. . . . . . . . . . . . . . 61
3.3 PLL A settings in the register mapping. . . . . . . . . . . . . 65
3.4 Multisynth 0 settings in the register mapping. . . . . . . . . . 66

4.1 Adjustment of the capture value depending on buffer fill level. 72
4.2 Clock frequency adjustment based on buffer fill level. . . . . . 77
4.3 Period peak-to-peak and cycle-to-cycle maximum values from the

jitter measurements. . . . . . . . . . . . . . . . . . . . . . . . 87

B.1 Register values generated by ClockBuilder Pro for the Adafruit
Si5351 clock generator. . . . . . . . . . . . . . . . . . . . . . . VII

B.2 Manually calculated register values used in adaptive mode for
the Adafruit Si5351 clock generator PLL A block. . . . . . . . VIII

B.3 Manually calculated register values used in adaptive mode for
the Adafruit Si5351 clock generator multisynth 0 block with the
integer bit set. . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII

B.4 Manually calculated register values used in adaptive mode for
the Adafruit Si5351 clock generator multisynth 0 block with the
fractional bit set. . . . . . . . . . . . . . . . . . . . . . . . . . IX

xvii


List of Tables

xviii


Terms and Abbreviations

This section lists terms and abbreviations that are used throughout the thesis.

ADC Analog-to-digital converter

AES/EBU Digital audio transfer interface standard for professional use created
by the Audio Engineering Society (AES) and the European Broad-
casting Union (EBU). Interchangeably sometimes called AES3.

API Application programming interface

ASRC Asynchronous sample rate converter

BMC Biphase mark code

CD Compact disc

CDF Cumulative distribution function

CRC Cyclic redundancy check

DAC Digital-to-analog converter

DC Direct current

DMA Direct memory access

DSI Digital system interconnect

FF Fixed frequency

FIFO First in, first out

GUI Graphical user interface

I2C Inter-integrated circuit

I2S Inter-IC sound

IAD Interface association descriptor

IC Integrated circuit

IDE Integrated design environment

xix


Terms and Abbreviations

IEC International Electrotechnical Commission

IMO Internal main oscillator

LSb Least significant bit

MCKL Master clock

MSb Most significant bit

NRZI Non return to zero invert

PDF Probability density function

PID Packet identifier

PLL Phase-locked loop

ppm Parts-per-million

PSoC Programmable system-on-chip

Red book The compact disc digital audio specification is printed in a book
that has a red cover. Hence the term “Red book” refers to regular
CD audio format.

S/PDIF Sony/Philips digital interface. Digital audio transfer interface stan-
dard based on AES/EBU and intended for consumer audio prod-
ucts.

SCK Serial clock

SD Serial data

SOF Start-of-frame

SWD Serial wire debug

TIE Time interval error

TX Transmit

UART Universal asynchronous receiver/transmitter

UDB Universal digital block

USB Universal Serial Bus

VCO Voltage controlled oscillator

WS Word select

XO Oscillator

XTAL Crystal

xx


1
Introduction

This chapter introduces the topic of the thesis and gives an explanation as to why
it was selected. It starts out with a brief description of the background and mo-
tivation for the thesis subject, followed by scope and delimitations along with the
requirements for the intended hardware and software build. Lastly, an overview of
the structure of the rest of the report is provided.

1.1 Background and Motivation
Digital audio data can be created by sampling and quantizing an analog sound wave
by the use of an analog-to-digital converter (ADC). The number of bits used for
each sample will determine the resolution, or how accurately the amplitude of the
analog signal can be represented in digital form, and the sampling frequency will
as described by the sampling theorem [1] effectively put an upper bound on the
frequency range that can be sampled and stored digitally by the ADC. To play back
the digitally stored audio, the digital audio data can be fed together with a clock
signal that matches the sampling rate into a digital-to-analog converter (DAC),
which then consequently converts the digital signal to an analog one that can be
sent to a speaker, usually first passing through an amplifier.

A digital signal is less susceptible to interference than what an analog signal is, so
there is a motive in trying to keep the audio in the digital domain for as long as
possible before eventually having to convert it to analog format so that it becomes
audible. This sometimes means that it will be necessary to transfer the digital signal
between different audio devices, creating a need for a robust transfer protocol and a
digital audio interface that will keep the audio data intact during the transfer. Any
bit errors introduced into the audio data could severely degrade the sound quality,
so they must be avoided. Most digital audio interfaces can be considered reliable
when it comes to this characteristic [2], but it is not only the audio data itself which
must be preserved in the transfer process; the clock signal, which is sent in parallel
with the audio data into the DAC must also remain unaffected by the transfer from
one device to another, as small timing errors in the clock signal, defined as jitter [3]
can lead to subtle but still audible effects [4]. This is however something which at
times has been neglected.

The AES/EBU interface [5] created for professional use by the Audio Engineer-

1


1. Introduction

ing Society (AES) and the European Broadcasting Union (EBU), and its equivalent
counterpart for consumer audio devices, the Sony/Philips digital interface (S/PDIF)
[6], standardized in IEC 60958 [7] by the International Electrotechnical Commission
(IEC), both exhibit weaknesses in how the clock signal is handled [2]. Universal
Serial Bus (USB) has over time gained popularity as a dedicated digital audio inter-
face, but its characteristics when it comes to jitter performance in the clock signal
depend largely on how the interface is implemented [8]. A separate clock signal is
not per se sent for USB audio, but there is still a need to keep the clocks at both
sides of the interface synchronized. Historically, for audio companies recognizing
the problem with jitter being introduced into the clock signal by AES/EBU and
S/PDIF, the selected course of action has often been to keep using that same digital
audio interface design which introduces the jitter in the first place, and then with
various methods [9] try to remove or reduce the jitter from the clock signal again
once the transfer of the digital audio data has been completed. The result of this
approach has often been added hardware complexity and increased development and
manufacturing costs, while it still remains questionable if the jitter has been removed
or diminished to sufficiently low levels. A better approach it seems would be to use
an audio interface design which does not introduce interface jitter into the clock
signal to begin with. In theory, such an interface can be created utilizing a USB
audio device class AudioStreaming interface [10] configured to run in asynchronous
synchronization mode [8, 9].

1.2 Scope and Delimitations
The aim for this project is to see how digital audio interfaces can be implemented and
then try to build a digital audio interface which does not add jitter to the clock signal
in the process of transferring digital audio from one device to another. The interface
should also be able to transfer the digital audio data reliably without bit errors or
significant delay in the signal path. This is to be accomplished by implementing
a USB audio device class AudioStreaming interface using a Cypress programmable
system-on-chip (PSoC) development board and microcontroller. The build requires
both hardware and software design in order to produce a testing platform. Jitter
measurements are also to be performed and presented but no listening tests or any
other kind of auditory assessment regarding jitter and its impact on audio quality
will be made.

1.3 Functional Requirements
The interface should fulfill the following functional requirements:

• Support standard Red Book 2-channel Compact Disc (CD) audio data as
input; 16 bits per sample at a sample rate of 44.1 kHz.

• Use inter-IC sound (I2S) output format for the received audio data so that it
can be sent to a DAC to be converted to analog and played back.

2


1. Introduction

• Follow interconnect and interface standards so that the test device can be
plugged into an audio source without need for special ports, cables, drivers or
software.

No additional USB audio functions except for the necessary AudioControl and Au-
dioStreaming interfaces and their associated endpoints will be implemented in the
USB module.

1.4 Outline
The rest of the thesis is organized as follows. Chapter 2 contains theory about jitter
audibility, statistics, and the taxonomy of jitter types. A large part of the chapter is
also devoted to a walkthrough of the relevant audio interface formats and their char-
acteristics. In Chapter 3, the common design layout for all implementation modes
and the specifics for each one of them are described. The clock configuration is being
examined in detail while other parts of the system setup are treated in a more gen-
eral sense. Chapter 4 shows the functional results for the different implementation
modes together with measurements of the jitter levels, presented both in numbers
and visualized as histograms. The results and their validity are then evaluated. The
report ends with Chapter 5 containing a summary of the findings and sustainability
and environmental considerations. A listing of the device descriptors used for the
synchronous, asynchronous and adaptive mode USB audio interfaces can be found
in Appendix A, programmed register settings for the external clock generator board
are located in Appendix B and in Appendix C are histograms from the period and
cycle-to-cycle jitter measurements.

3


1. Introduction

4


2
Theory

This chapter starts out by defining what jitter is and the possible causes to its
existence in a system. Then follows an introduction to statistical theory and a
characterization of the different jitter types. An overview of previous work related
to jitter audibility testing and audibility threshold theory is presented and we will
also get more acquainted with a selection of some commonly used audio interface
formats. Particular attention is being paid to some of the possible sources of jitter
often being associated with the AES/EBU and S/PDIF interfaces. The chapter
ends with a look at clock generation with phase-locked loops (PLLs) and fractional
dividers.

2.1 Jitter
The following section of this chapter defines what is meant by jitter, it introduces
some statistical terminology for jitter distributions and it also characterizes the dif-
ferent kinds of jitter that we may encounter. Conducting listening tests of any
kind is out of scope for this thesis but results and observations from auditory assess-
ments made in other studies are presented and one purely theoretical jitter audibility
threshold model is also provided.

2.1.1 Definition of Jitter
Jitter can be defined as the short-term time displacement a digital signal has relative
to its ideal position in time [3, 11]. Let us start by viewing an ideal square wave
clock signal with 50 percent duty cycle for which each clock cycle starts at the rising
edge of the signal at time 1τ , 2τ , 3τ etc. If the clock signal is affected by jitter, the
rising and falling edges can be offset from their ideal positions in time as visualized
by the shaded areas in Figure 2.1. This offset of a signal compared to an ideal
reference point in time is called the time interval error (TIE). Depending on the
context, we may choose to compare an examined signal to an ideal reference like we
do in Figure 2.1, or if no reference signal exists, we can instead look at the rising
clock edges and compare their time of occurrence from one clock cycle to the next.
In Figure 2.2 we see an example of the latter where the period jitter Pn for clock
cycle n is the difference between two consecutive rising edges of the signal. Another
commonly used measure that does not require a reference signal is the cycle-to-cycle

5


2. Theory

jitter denoted by C and it can be calculated by measuring the difference between
two consecutive period jitter values as shown in Figure 2.2. Cycle-to-cycle jitter is
usually expressed as an absolute value and not in terms of negative numbers.

It is worth to note that jitter only is the short-term time deviation in a signal, and
that deviation over a longer period of time instead is defined as drift or wander. This
could for example typically be the accumulated time deviation between a reference
clock and a second free running clock that are in sync at start but where the jitter
in the free running clock then causes it to become more and more out of sync with
the reference as more and more clock cycles go by.

1τ 2τ 3τ

TIE 1 TIE 2 TIE 3

Time

A
m
p
li
tu
d
e

Ideal clock signal
Clock signal with jitter

Figure 2.1: Jitter assessment for a clock signal having a reference.

P1 P2

C1 = |P2 − P1|

Time

A
m
p
li
tu
d
e

Solitary clock signal

Figure 2.2: Jitter assessment for a solitary clock signal.

6


2. Theory

2.1.2 What Causes Jitter?
Although jitter is seen as a shift in the time domain, it is often caused by a distur-
bance in the voltage domain. In Figure 2.3 we see how noise of amplitude ∆V can
create a difference in the voltage level for a rising signal edge and give rise to jitter
of size ∆t as it makes the signal reach the threshold level of the signal transition at
a different point in time than expected.

Threshold level
∆t

∆V

Time

A
m
p
li
tu
d
e

Signal with noise added
Signal without noise
Noise level

Figure 2.3: Visualization of how noise in the voltage domain can produce jitter.

There are many possible causes for voltage noise. It can originate from sources
external to the signal path. Examples of this are 50 Hz to 60 Hz interference from
the fundamental power line frequency, switching power supply noise and capacitive
or inductive crosstalk from other cables or signal paths. Noise can also arise from
sources within the signal path. Internal thermal noise caused by electrical compo-
nents, shot noise appearing due to fluctuations in the flow of electrons or holes in
semiconductors, burst noise and 1/f noise occurring in electrical components due to
material imperfections are examples of this. Any kind of variation in voltage level
can lead to jitter.

2.1.3 Probability Theory for Jitter Distributions
As we from Chapter 2.1.1 now know how to define jitter looking at one clock cycle
at a time, we will introduce two terms from probability theory and statistics which
will aid us when handling longer series of jitter measurements. The first one is the
cumulative distribution function (CDF) [3, 12]. Looking at the transition times of
the rising edges of a clock signal that is affected by jitter, we can create a function
which indicates the probability of the signal having reached its high state at a certain
point in time relative to the ideal transition time of the clock signal. This function
is called the CDF and a theoretical example is displayed in Figure 2.4. Before τ1, a

7


2. Theory

long time ahead of the ideal transition time for each edge, none of the rising edges
have reached the high state and the probability of a state transition having happened
is zero. As we move past τ1 and closer to the ideal transition time τi for each edge,
more and more state transitions are starting to happen. In our theoretical example
in Figure 2.4, the number of state transitions happening before the ideal transition
time τi has for simplicity been set equal to the number of state transitions happening
after the ideal transition time τi, and more state transitions are also happening closer
to the ideal transitions time τi than further away from it. This does not necessarily
need to be true for an actual series of real world jitter measurements, but it gives
us a feasible model which we can work with to understand probability theory for
jitter distributions. As we move past the ideal transition time τi towards τ2 in our
example, fewer and fewer new state transition happen the further away from τi we
get, while the probability of a state transition having happened, the CDF, continues
to rise and it reaches its maximum value when we cross τ2, at which point all rising
edge state transitions for the theoretical measurement series have already happened.
The CDF is a monotonic increasing function, meaning its value will never decrease
but instead it will either always remain constant or increase as the function variable
increases, and the CDF will go from 0 → 1 when time increases from τ1 → τ2.

τiτ1 τ2

0

1

Time

C
D
F

Figure 2.4: Theoretical example of a cumulative distribution function for a clock
signal with ideal transition time at τi.

The probability density function (PDF) [3, 12] is the second term from probability
theory and statistics that we will introduce in this section. Let us first consider the
probability for a signal transition to happen at time τp. The probability of a signal
transition happening exactly at an arbitrary point τp in time is zero as that would
require the the transition to take place within an infinitely small time span, but if
we instead look at a small time bracket from τp - γ to τp + γ as in Figure 2.7, then
the probability for a transition to happen within that time range can be expressed.

8


2. Theory

The mathematical relation between the CDF and the PDF is

CDF (t) =
∫

PDF (t)dt (2.1)

τiτ1 τ2

0

Time

P
D
F

Figure 2.5: The probability density function corresponding to the cumulative
distribution function in Figure 2.4.

τ1 τp τi τ2

0

Time

P
D
F

Figure 2.6: The probability density function from Figure 2.5 divided into time
brackets with an arbitrary point in time τp selected.

9


2. Theory

τp − γ τp τp + γ

0

Time

P
D
F

Figure 2.7: A closeup of the probability density function in Figure 2.6 around the
arbitrarily selected point in time τp.

Figure 2.5 displays the PDF corresponding to the CDF in Figure 2.4. From Equa-
tion 2.1 we also realize that choosing to look at the PDF for a single point in time
τp instead of a time interval τp - γ to τp + γ will give us an integration interval
ranging from τp to τp, and the result of

∫ τp
τp

PDF (t)dt will therefore be 0, so we need
to express the probability of a signal transition happening at τp as the probability
of it happening during a time interval τp - γ to τp + γ and not at an exact single
point in time. When dealing with any real world measurement series, we will often
organize our measurements to fit into predefined time brackets like we have done in
Figure 2.6 for the theoretical PDF from our example.

2.1.4 Jitter Types
Jitter is often characterized as belonging to one of two categories, being either
random or deterministic. The main difference between the two is that random jitter
is unbounded, i.e. the jitter can in theory take on any value while deterministic
jitter is bounded and therefore only has a limited range of values it can assume.
Depending on the source of the deterministic jitter and its characteristics, it is often
being specified further as belonging to one of a number of subcategories of its main
jitter type. These subcategories of deterministic jitter are presented along with
random jitter in more detail in the following sections. Looking at the plotted PDF
for a measurement series may help us identify which jitter type we are dealing with
so that we can try to determine its cause. The total jitter at any given moment is
the sum of all jitter components that happen to be present at that point in time and
jitter in any real world measurement is likely to be a composite of multiple jitter
types of different origins rather than of just one single type.

10


2. Theory

Total jitter

Random jitter Deterministic jitter

Periodic
Data

dependent
Duty cycle
distortion

Bounded
uncorrelated

Figure 2.8: Jitter components contributing to total jitter.

2.1.4.1 Random Jitter

The most important properties for random jitter [3, 11] is that the jitter is un-
bounded and that the PDF for the majority of cases of can be represented by a
normal distribution [12]:

f(x) = 1
σ

√
2π

e− (x−µ)2

2σ2 (2.2)

By selecting the mean value µ for the normal distribution to our ideal transition
time t = 0 for the function and setting the standard deviation σ to 1, we can simplify
the general expression for the normal distribution in Equation 2.2 to

PDFrandom(∆t) = e− ∆t2
2

√
2π

(2.3)

also replacing the variable x with ∆t. A graph of the PDF for random jitter with
ideal transition time 0 and standard deviation σ = 1 is displayed in Figure 2.9. All
the internal types of noise listed in Chapter 2.1.2 belong to random jitter.

−6 −4 −2 0 2 4 6
0

0.2

0.4

0.6

0.8

1

∆t

P
D
F
r
a
n
d
o
m

Figure 2.9: Probability density function for random jitter.

11


2. Theory

2.1.4.2 Periodic Jitter

Periodic jitter [3, 11] is jitter which repeats with a certain time interval. It is
however totally uncorrelated to any clock or data signal in the system and the
maximum frequency at which the jitter appears must be less than half the data
rate in order for the jitter to be considered to be periodic and not data dependent.
Periodic jitter can often be assumed to have a sinusoidal waveform, and for more
complex cases the periodic jitter can be decomposed into a discrete Fourier series
consisting of multiple sinusoidal waveforms that can be treated separately. The PDF
for sinusoidal periodic jitter can be written

PDFperiodic,sinusoidal(∆t) =


1

π
√

a2−∆t2 |∆t| ≤ a

0 |∆t| > a

(2.4)

and its graphical representation is displayed in Figure 2.10.

-a 0 a

∆t

P
D
F
p
e
r
io
d
ic
,s
in

u
s
o
id
a
l

Figure 2.10: The probability density function for a sinusoidal periodic jitter dis-
tribution.

2.1.4.3 Data Dependent Jitter

Data dependent jitter [2, 3, 11] is as the name implies a type of jitter which is
dependent on the data pattern that precedes the time at which the jitter manifests
itself. There are multiple mechanisms that contribute to this jitter type and they
are all related to the signal level being offset in relation to the threshold level which
denotes the signal transition. It can be due to reflections in the signal path caused
by an impedance mismatch or because the signal transition begins from a voltage
level lower or higher than expected on behalf of the signal not having had time to

12


2. Theory

settle from the previous signal transition. Bandwidth limitations and asymmetrical
slew rates may also affect the rise and fall times of the signal. Any reflections on
the signal path will die out within a limited amount of time, resulting in just the
most recent data pattern having an affect on the signal level and jitter. The PDF
for data dependent jitter can be represented by

PDFdependent(∆t) =
N∑

j=1
{pj × δ(∆t − tj)}, where

N∑
j=1

pj = 1 (2.5)

In Equation 2.5, δ(∆t − tj) is the Dirac delta function[13], which has the properties

δ(x) =


∞, x = 0

0, x ̸= 0
and

∫ ∞

−∞
δ(x)dx = 1 (2.6)

The graphical representation of the PDF for data dependent jitter will typically
have just a few discrete vertical asymptotes, which do not necessarily all have the
same height as some data patterns causing the jitter could be more frequent than
others.

t1 t2 0 t3 t4

∆t

P
D
F
d
e
p
e
n
d
e
n
t

Figure 2.11: Typical probability density function for data dependent jitter.

2.1.4.4 Duty Cycle Distortion

The duty cycle defines how much time a digital signal spends in the high state versus
how much time it spends in the low state. For an ideal clock signal the ratio would
be 50/50 as the signal alternates back and forth between high and low, spending
exactly the same amount of time in each state. Deviation from this ideal scheme,
whether it is caused by an offset signal amplitude, asymmetry in rise and fall times,

13


2. Theory

or an offset threshold level for the signal transition is called duty cycle distortion
[3, 11]. The PDF for duty cycle distortion will look like the two equally tall peaks
in Figure 2.12, if both rise and fall transitions are included, and mathematically the
PDF can be expressed as

PDFduty(∆t) = δ(∆t − a)
2 + δ(∆t + a)

2 (2.7)

where δ(∆t ± a) is the Dirac delta function from Equation 2.6.

-a 0 a

∆t

P
D
F
d
u
ty

Figure 2.12: Probability density function for duty cycle distortion.

2.1.4.5 Bounded Uncorrelated Jitter

Bounded uncorrelated jitter [3, 11] covers any deterministic jitter which does not fit
into any of the other three categories of deterministic jitter that have been presented
in this chapter. The sources for this type of jitter can be many and the variety of
causes does not make this category of jitter lend itself to making any particular
generalizations about it. We will therefore just settle for using it to categorize
any bounded jitter which is not periodic, data dependent or caused by duty cycle
distortion.

2.1.5 Audibility of Jitter
An important question that we should ask ourselves is, “How much jitter can be
tolerated before it starts to affect the sound quality?” In order to give a proper
answer, we would need to ask subsequent questions such as, “What frequency range
does the audio affected by jitter have?” and “What type of jitter is the audio signal
being affected by?” Studies by Benjamin and Gannon [14] and Ashihara et al. [15]
have shown that jitter will be more audible in source material that has more high

14


2. Theory

frequency content than one consisting of lower frequencies. This is because the effect
of jitter on an audio signal not only is proportional to the amount of timing error
in the signal, but also to the overall slope of the curve of the audio signal being
affected. As is visualized in Figure 2.13, the same amount of timing error ∆t on two
sine waves of same amplitude but different frequencies will produce a bigger offset
a2 in the amplitude on the signal with a steeper slope compared to the amplitude
offset a1 on the signal with a more gentle slope, so the distortion is more likely to
be audible when playing high frequency audio content.

a1

a2

∆t

Time

A
m
p
li
tu
d
e

Figure 2.13: The size of an amplitude error caused by a timing error depends on
the slope of the signal.

For the question regarding the type of jitter affecting the audio signal, one of the
first well known studies on this subject conducted by Manson [16] back in 1974
found the hearing threshold for sinusoidal jitter to be a little bit lower than for
random jitter. The mentioned studies by Manson [16], Benjamin and Gannon [14]
and Ashihara et al. [15] all include listening tests by which their respective jitter
audibility thresholds are determined, but none of them do in subjective terms express
how the test participants experienced jitter to affect the sound or what made the test
subjects pick out the tracks with jitter and distinguish them from the ones without
it. We can however note that most selected test tracks contained high frequency
source material as jitter audibility is greater for higher frequencies. Tracks with
solitary elements and sparse sound like a single instrument were also favored in
place of more complex tracks with a multitude of sound sources as that also made
it easier to detect the added jitter.

Returning to the study from 1974 by Manson [16], it set the limit at which jitter
could only be heard by less than 5 % of the listening audience for sinusoidal jitter
with frequency above 2 kHz to 35 ns, and for random jitter the same limit was
determined to be 50 ns. For sinusoidal jitter with frequency lower than 2 kHz, the

15


2. Theory

tolerance threshold proposed by Manson increases linearly as the frequency of the
jitter is lowered. Test tracks consisting of piano and glockenspiel were selected to
provide the most critical material out of a range of tracks auditioned by experienced
listeners. All tests were conducted in one room using the same speaker system and
the test participants were all described as having previous experience of assessing
sound quality. During the listening tests which were carried out one person at a
time, the test subject was allowed to control the listening level and was also given
a control from which the jitter level in the recording could be adjusted and then
another one by which the addition of jitter could be turned off completely. The
test subject was then asked to find the threshold level for jitter audibility using
the controls available. Jitter was added to the audio signal by passing it through
two sample-and-hold units and then reclocking it in the second one by applying
a control signal which perturbed the clock signal to simulate both random and
sinusoidal jitter depending on the setting. Low-pass filters were also added before
and after the described jitter addition circuit to comply with the sampling theorem.
With the study being conducted nearly 50 years ago, tape recordings were used as
source material and playback was done in monophonic audio. Surely some advances
in both recording and playback technology have been made since, but whether using
stereo playback instead of mono would make the jitter audibility threshold limits
any lower is debatable. Small variations in timing in the microsecond range between
what the left and right ear registers can be picked up by the hearing system to
provide spatial information [17]. Given that the added jitter necessarily does not
affect both channels equally, any disturbance caused by the jitter could possibly be
picked up more easily if the audio was to be played back in stereo. On the other
hand the jitter threshold levels found were way below the microsecond range and
any added complexity in the source material makes it more difficult to distinguish
a track with added jitter from the original, in which case adding an extra channel
possibly could have made the recorded audibility threshold for jitter even higher.

In 1998 Benjamin and Gannon [14] also conducted a study where they performed
listening tests in order to try to determine the jitter audibility threshold. As the
audibility of jitter was found to greatly depend on the dynamic variation in the
frequency spectrum of the examined audio, a lot of effort was put into finding source
material where the effects of jitter would be easy to hear. Based on the criteria of
having plenty of frequency content at 1 kHz or above, minimal frequency content
between 400 Hz and 1 kHz, long sustain and low noise floor, this resulted in the
majority of test tracks consisting of one note from a single instrument.

During the initial phase of creating the listening tests it was discovered that there
was a learning effect taking place where the person being subjected to the jitter
audibility testing up to a certain degree was able to increase their ability to hear
the effects of jitter, thus lowering the jitter audibility threshold in successive tests.
A learning phase was therefore added for all test participants prior to the listening
tests used to determine the jitter audibility threshold in order to let the test subject
to get familiar with the source material, controls, and test procedure to not have the
threshold value decrease while the real tests were being carried out. Any intended
test participants who had severe difficulties distinguishing the distortion caused by

16


2. Theory

jitter during the training phase were excluded from the final testing.

After the training phase, testing began with solitary sine wave tones of frequencies
4 kHz, 8 kHz and 20 kHz as source material to which sinusoidal jitter then was added.
The jitter level was at first increased and the test subjects were asked to indicate
when they were able to hear the resulting distortion. Then the jitter level began
to slowly decrease until the test subject indicated that they were no longer able to
hear the distortion caused by the jitter. The process was then repeated a couple of
more times for all three sine wave frequencies and the top, bottom and calculated
average level was recorded for each participant. Table 2.1 lists the range of calculated
average threshold levels for the test participants.

Audio frequency Jitter frequency Jitter audibility threshold
4 kHz 2 kHz 40 ns to 150 ns
8 kHz 5 kHz 5 ns to 25 ns

20 kHz 17 kHz 7 ns to 14 ns

Table 2.1: Range of calculated jitter audibility thresholds for all test participants
when playing sine wave tones with added sinusoidal jitter.

In the next part of the listening test, the earlier selected audio source material was
played back to the test participant. Now given access to control the level of sinusoidal
jitter added to the source material as well as having the ability to switch between the
audio signal with added jitter and one with without at will, the test participant was
asked to adjust the controls until the threshold level for jitter distortion audibility
had been reached. Table 2.2 shows the recorded threshold ranges for the participants
for each test track.

Test track Jitter frequency Jitter audibility threshold
1: One note, single instrument 1.70 kHz 50 ns to 270 ns
2: One note, single instrument 1.85 kHz 32.5 ns to 110 ns
3: One note, single instrument 1.70 kHz 20 ns to 310 ns
4: Synthesized music recording 1.53 kHz 112 ns to 370 ns*

*Not all test participants were able to find an audibility threshold for track 4.

Table 2.2: Range of jitter audibility thresholds recorded for all test participants
when playing program material with added sinusoidal jitter.

The audibility threshold for the jitter added to the higher frequency sine waves
was slightly lower than it was for any of the other more regular program material
and the results for the program material can be considered to be on par with what
Manson [16] found for sinusoidal jitter added to the selected program material in
his study. The same audio equipment was used for all the listening tests and a set
of headphones instead of speakers were selected to reproduce the audio recordings
in the study. The jitter was added to the source material by running the signal

17


2. Theory

through a jitter modulator to which a function generator was connected, through
which the jitter level could be controlled. Measurements on the audio system used
in the listening experiments indicated that the jitter levels inherent in the system
itself were way below any of the audibility thresholds recorded during testing and
should not have any influence on the test results according to the authors.

A third study including jitter audibility testing was also conducted by Ashihara et
al. [15] in 2005. In it, random jitter was simulated in software by creating new
sample values by interpolation after which the interpolated values were shifted to
the ideal sampling points in time. An anti-aliasing filter was also added to make sure
the sampling theorem was still satisfied. The test subjects, all consisting of people
with backgrounds in different audio fields were asked to audition source material of
their own selection using their own audio equipment. Only a computer with a digital
audio interface was provided as signal source and three controls, A, B and X were
given from which the playback of the source material could be controlled. Selecting
X always set the original source material without any added jitter to be played back.
One of the controls A and B was randomly set to also select the original non-jittered
source material while the other control selected the source material with the added
random jitter. The test subject was informed of this setup, asked to listen to the
selected source material for a couple of minutes and then at the end decide which
one of the controls A and B played back the same version as control X.

The test started with plenty of jitter being added to the source material and the test
was run multiple times under the same conditions. The test subject was allowed
to proceed to the next step where the jitter level was halved once 75% or more of
the attempts were correct. If too many incorrect answers were given, the test was
aborted and the final successfully determined jitter level was recorded. Table 2.3
shows the results from the listening test. None of the test subjects were able to
audibly distinguish the next level of random jitter after 500 nanoseconds. The
recorded jitter audibility threshold level being a bit higher in this study than in the
ones previously presented does however not come as a big surprise as only random
jitter was used and more importantly, the program material in the previous studies
was tailored to maximize the audible effects of jitter while more “normal” music
likely was used here as the participants were allowed to pick their own listening
material.

Random jitter Audibility among test participants
2 µs 100 %
1 µs 48 %

500 ns 26 %
250 ns 0 %

Table 2.3: Proportion of test participants that were able to hear the effects of
random jitter added to self selected source material.

In all three studies mentioned so far, listening experiments were used to determine
the threshold limits for jitter audibility. The lowest recorded threshold values from

18


2. Theory

any of the studies were in the single digit nanosecond range, noted when sine waves
of high frequency were used as program material. Using recordings of solitary real
instruments as program material increases the threshold to tens of nanoseconds and
using even more varied and complex source material sends the threshold limit into
the range of hundreds of nanoseconds.

2.1.5.1 A Theoretical Jitter Audibility Model

An idea commonly presented is that jitter levels below the quantization noise floor
will be inaudible [14, 15, 18]. In Figure 2.14, quantization with low resolution
has been used to show an example of an analog signal and its quantized digital
counterpart. The horizontal grid lines indicate the digital levels available that can
be assumed and the vertical grid lines mark the sampling interval points. Any real
audio application is likely to use a much higher resolution to produce a smoother
digitalized waveform that more closely resembles the analog signal, but even then,
there will still be a least significant bit (LSb) size in the digital representation of the
audio signal that together with other system parameters sets the level of the noise
floor. For a DAC with a resolution of N bits, the total number of values that can
be represented is 2N and the LSb is 1/2N of the total representable range.

Time

A
m
p
li
tu
d
e

Analog signal
Quantized signal

Figure 2.14: Analog sine wave and the resulting waveform after quantization in
low resolution.

To find a relation between all the system parameters and the quantization noise
floor, we can start by considering a sine wave:

y(t) = Asin(2πft). (2.8)

The rate of change for the curve is
dy(t)

dt
= 2πfAcos(2πft). (2.9)

19


2. Theory

At t = 0 the rate of change will have its maximum value as the slope for the sine
wave will be the steepest there. The limit lim

t→0 cos(2πft) = 1, and we are left with
(

dy

dt

)
max

= 2πfA. (2.10)

If the full range of a converter is used to present a sine wave of amplitude A, then
the interval from 0 up to the highest representable digital level will be of magnitude
2A. We know that the LSb is 1/2N of the total representable range, so we multiply
the expression with 2A. For any signal, the rate of change multiplied by the amount
of time during which the change occurs will determine the new level of the signal.
The LSb can therefore also be expressed as the the rate of change dy/dt multiplied
by the amount of time tj it takes to reach the new level when the rate of change is
at its maximum. Rearranging the first expression a bit we then have

tj
dy

dt
= LSb = A

2N−1 . (2.11)

Substitution in Equation 2.10 with Equation 2.11 gives us

A

tj2N−1 = 2πfA. (2.12)

After rearranging Equation 2.12 we have an expression for tj which corresponds to
the time of one LSb:

tj = 1
2πf2N−1 . (2.13)

For any timing jitter to fall below the quantization level floor, it would need to be
equal to less than half a LSb, so any result to Equation 2.13 needs to also be divided
by a factor of 2. For a 16-bit converter with maximum sampling frequency of 20 kHz,
the jitter level should therefore be lower than tj/2 = 121 ps for it to fall below the
quantization noise floor and be inaudible.

2.2 AES/EBU and S/PDIF
The physical appearance and the electrical characteristics of the two interfaces are
different as the for professional use intended AES/EBU [5] interface uses balanced
XLR connectors while the more consumer oriented S/PDIF [6] uses either a coax-
ial cable with RCA connectors or an optical wire with TOSLINK connectors, but
beneath the dissimilar exterior they both use the same data transfer protocol. The
audio data and the clock signal are transferred along the same data line, combined
into one bit stream using biphase mark code (BMC). Data is transferred in blocks
consisting of 192 frames, every frame is divided into multiple subframes, one for
each audio channel, and each subframe contains 32 bits of data. Figure 2.15 shows
the structure of the transfer scheme including the placement of the data fields inside
a subframe.

20


2. Theory

Frame 1 Frame 2 ... Frame 192

(a) Block

Subframe 1 Subframe 2 ... Subframe n-1 Subframe n

(b) Frame

Preamble Auxiliary Audio sample word V U C P

0 4 8 28 29 30 31

(c) Subframe

Figure 2.15: Data structure of AES/EBU and S/PDIF. The contents of a) an
audio block, b) a frame and c) a subframe.

Bit no. Subframe field Usage
0–3 Preamble Indicates the start of a subframe. The field spec-

ifies if it is a) the first subframe within a frame,
b) the first subframe in a block, c) any other sub-
frame.

4–7 Auxiliary bits Used to add auxiliary information or can be as-
signed to carry an extra 4 bits of audio data, ex-
tending the audio sample word size from 20 to
24 bits.

8–27 Audio sample word The bits used to carry the audio data sent with
LSb first.

28 Validity bit Indicates if the audio sample word contains valid
audio data or not.

29 User data bit Can be used to carry any user defined data.
30 Channel status bit Used to indicate type of interface, sample rate,

copy permission and other settings. The mean-
ing of the bit is dependent on the frame number
within the block in which it is transmitted. All
subframes within a frame carry the same channel
status bit.

31 Parity bit Used to detect errors in the transmitted data.

Table 2.4: Description of subframe data fields.

2.2.1 Biphase Mark Code
The clock signal and the audio data together with all other parts of the subframe
are for the AES/EBU and S/PDIF interfaces sent along the same data line using
biphase mark encoding [19], also known as Differential Manchester encoding. For

21


2. Theory

each bit of data that is sent, at least one signal transition is guaranteed to happen.
If the data bit is a “0”, the encoded signal sent to the receiver will switch polarity
once at the start of the time slot, and if the data bit is a “1”, the signal will change
polarity twice, once at the start and once in the middle of the time slot for that bit.
The only part of the subframe data that is allowed to violate this condition is the
first part containing the preamble bits. It only has three valid bit sequences used
for indicating the placement of the subframes in the data stream. The reason for
this is to ensure so that no other data in the subframe can contain the same bit
sequence as the preamble, throwing off the synchronization of the data that is being
transmitted. An example of BMC is shown if Figure 2.16.

Clock

Data

BMC

+

-

0 1 1 0 0 0 1 0

Figure 2.16: Biphase mark code timing diagram.

2.2.2 Clock Recovery

After data has been transmitted, the receiver will need to recover the clock signal
and separate it from the rest of the encoded data and there are some problems that
can arise in the recovery process. A study by Dunn and Hawksford [2] done relatively
soon after AES/EBU and S/PDIF were started being used widely points out many
of the known problems with the interfaces’ characteristics; the most significant one
being that jitter can be dependent on the data pattern in the transmission. Due
to bandwidth limitations, the rise and fall times of the transmitted signal will be
limited. This can cause the signal to begin the transition from high to low or vice
versa from a voltage level depending on the previous data pattern as the signal has
not had time to settle from the previous transition due to the limited bandwidth.
Dunn and Hawksford [2] created a simulation model of a bandwidth limited trans-
mission channel using a passive low pass filter, which despite its simplicity showed
a generally good agreement with measurements conducted on real systems.

22


2. Theory

R

CVin Vout

Figure 2.17: First order passive low pass filter used to simulate bandwidth limited
transmission channel.

The voltage Vout over the capacitor in Figure 2.17 is

Vout = Vin

(
1 − e− t

RC

)
+ V0e

− t
RC , (2.14)

where Vin is the amplitude of the non-filtered signal and V0 is the voltage level
from which the signal transition begins. Using the filter to simulate a bandwidth
limited transmission line with time constant τ = RC of 100 ns between a S/PDIF
transmitter and receiver gives the graph in Figure 2.18. Looking just at the first
eight bits of the subframe in Figure 2.19, we can more clearly see that for each
signal transition the starting voltage is different and that it depends on previous
signal transitions. The consequence of this is that the time when the threshold level
at 0 V will be reached at a signal transition will vary depending on the previous
bit pattern and as we have seen before, a shift in voltage level can create timing
jitter. This is shown in Figure 2.20 where bits four and five of the subframe are
displayed and the difference between the signal edge and the bandwidth limited
signal is inconsistent from signal transition to signal transition. The threshold level
0 V crossing time is given by the equation

t = RC ln
(

1 +
∣∣∣∣ V0

Vin

∣∣∣∣) . (2.15)

Solutions to help lessen this issue with data dependent jitter include using data
patterns in the auxiliary bits and user bits which are less prone to creating jitter [2]
and to only use the first bits in the preamble to lock on to the signal and to create
a local clock from only that bit sequence instead of using every transition in the
whole subframe to generate it [20].

V−

0

V+
PREAMBLE AUX AUDIO WORD V U C P

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Figure 2.18: Transmission of one subframe over bandwidth limited channel.

23


2. Theory

PREAMBLE AUX

0 2 4 6 8

V−

0

V+

Figure 2.19: First eight bits of the subframe.

3 4 5

V−

0

V+

t1 t2

Figure 2.20: Bits four and five of the subframe.

2.2.3 Asymmetric Slew Rates
Another thing that can cause jitter in AES/EBU and S/PDIF interfaces is slew rate
imbalance, giving asymmetric rise and fall times for the signal. Dunn and Hawksford
[2] presented the formula shown in Equation 2.16 for the amount of jitter tj per signal
transition that is created by an asymmetry in the slew rates. Vd is the driving voltage
of the transmitter, VSR+ is the slew rate in the positive direction and VSR− is the
slew rate in the negative direction.

tj = |Vd|
2

∣∣∣∣∣ 1
|VSR+|

− 1
|VSR−|

∣∣∣∣∣ (2.16)

24


2. Theory

A visualization of a signal response with symmetric versus asymmetric slew rates
is shown in Figure 2.21. The slew rate limited response with symmetric rise and
fall times will cross the threshold level at 0 V at times τ1 and τ3 instead of at τ0
and τ2 like the ideal square wave does, but the time difference τ3 − τ1 = τ2 − τ0, so
this does not present a problem. The falling signal edge for the slew rate limited
response with asymmetric slew rates will on the other hand cross the threshold level
at τ4 instead of at τ3 as the falling edge only has half the rate of change per time
unit compared to the rise time and τ4 − τ1 ̸= τ2 − τ0, so the slew rate asymmetry
introduces jitter into the signal. One solution suggested to solve the issue is to let
the receiver rely on signal transitions in one direction only, effectively removing the
need for the slew rates to be even.

τ0τ1 τ2τ3τ4

V−

0

V+

Threshold

Time

V
ol
ta
ge

Ideal square wave
Slew rate limited, symmetric
Slew rate limited, asymmetric

Figure 2.21: Symmetric versus asymmetric slew rate response to an ideal square
wave.

2.2.4 Transmission Lines
A model for a transmission line used for high frequency signals [1, 19, 21] is shown
in Figure 2.22. The parameters, resistance R, inductance L, capacitance C and
conductance G are per length unit of transmission line. For a S/PDIF interface
using a coaxial cable to transfer audio, the same model can be applied.

R L R L R L R L

1

G
C

1

G
C

1

G
C

1

G
C

Figure 2.22: Cascaded network model for high frequency transmission line.

25


2. Theory

The transmission line has a characteristic impedance of

Z0 =
√

R + sL

G + sC
(2.17)

with s being the frequency operator, for a sine wave often denoted by jω. Let us
now attach a load with impedance ZL to the transmission line and then apply a
voltage pulse of size Vi to the other end of the line as depicted in Figure 2.23. What
then happens when the voltage reaches the load depends on the impedance ZL of
the load. For a perfectly matched system where the transmission line impedance Z0
is equal to the load impedance ZL, the whole voltage pulse Vi will continue into the
load, but if the impedances differ, a part of the voltage will be reflected back along
the transmission line. Equation 2.18 gives the reflection coefficient ρ of the system.

ρ = Vreflected

Vincident

= ZL − Z0

ZL + Z0
(2.18)

When the voltage pulse Vi in Figure 2.23 is applied, it will start to move along the
transmission line towards the load with the propagation velocity

v = c
√

ϵrµr

, (2.19)

where c is the speed of light, ϵr is the permittivity and µr is the permeability for the
transmission line. Equation 2.18 gives the ratio between the incident voltage and
the reflected voltage. If for example the load impedance ZL and the transmission
line impedance Z0 are severely mismatched with the load impedance ZL being twice
the size of the transmission line impedance Z0, then the reflection coefficient ρ is
0.33 and the amplitude of the reflected voltage pulse is one third of the amplitude of
the applied voltage Vi. For a more tightly matched system where the transmission
line impedance Z0 and the load impedance ZL only differ by 1 %, the amplitude
of the reflected voltage pulse at the load end will still be around 5 mV per 1 V of
applied voltage Vi. Figure 2.24 shows the voltage on the transmission line before it
has reached the load and after a part of it has been reflected.

ZLZ0Vi

Figure 2.23: Transmission line with load attached.

26


2. Theory

Load

0

Vi

(a) Approaching the load.
Load

0

Vi

Vi + Vref

(b) Reflected voltage at the load.

Figure 2.24: High frequency voltage pulse applied to a transmission line.

The transmitting side of the system where the voltage pulse Vi originates from does
also have an impedance of its own and the same reasoning applies to that end of the
circuit, so if there is an impedance mismatch between the transmitting side and the
transmission line, then a part of the voltage pulse first reflected at the load end will
be reflected once again when it reaches the transmitting side. A pulse can therefore
be reflected several times between the transmitter and receiver sides if both ends
have an impedance different from the transmission line impedance Z0. In reality, any
such voltage pulse bouncing back and forth is likely to diminish quickly as |ρ| < 1
for any case except a completely open or fully shorted circuit end. We know from
Chapter 2.1.2 that voltage noise can lead to timing jitter, so even small reflections
due to impedance mismatching between the transmission line and the load on high
frequency transmission lines, in our case the coaxial cable connecting the transmitter
and the receiver and the transmitter and receiver units themselves, can cause issues.
Proper impedance matching in the audio chain between the transmitter, receiver
and the cable connecting them is therefore necessary.

S/PDIF and AES/EBU signals being transmitted in coaxial cables are also affected
by other attributes of the transmission channel apart from the impedance. Dielectric
losses and the skin effect where a high frequency signal travels mainly along the
surface of the conductor only penetrating a short distance into the core of it are
some examples of things that could be expected to affect the voltage level and rise
times of the signal. Both are dependent on the material parameters of the cable
and on the frequency of the signal being transmitted, but as the frequency for the
clock signal being sent is expected to stay the same, then all signal transmissions
should be affected to an equal extent in which case no new variable jitter would be
added to the signal. Optical channels also have their own share of issues that could
be expected to affect a signal propagating through the optical wire such as pulse
dispersion and limited bandwidth in the transmitter and received components but
we will not go any further into if and how that might impact a digital audio signal
being transmitted.

27


2. Theory

2.2.5 FIFO Buffers
One attempt at solving the interface jitter issues of AES/EBU and S/PDIF has been
to insert a first in, first out (FIFO) buffer between the receiver chip and the DAC
in the converter and to then reclock the data coming out of the FIFO buffer. It has
been used by some audio manufacturing companies, but there are some drawbacks
to this method. The introduction of a buffer in the audio chain will undoubtedly
delay the audio signal. This might be acceptable up to certain degree if the audio
only is used for music playback, but if the transmitted audio stems from a video
stream, then the audio and video can become noticeably out of sync unless the
buffer is small. A delay could also cause problems if the audio system would be
used for communication in a telephone type of manner, as that would make the
communication disruptive and less smooth.

The purpose of the added FIFO is to reclock the signal with a clock that has less
jitter than the one arriving from the transmitter, so two clocks, the one supplied by
the transmitter feeding the audio data into the FIFO and a second one supplied by
the receiver moving data out of the FIFO will be running freely, not synchronized to
each other. The clocks will essentially be running at the same rate but any difference
or variation at all in the clock rates will make the clocks start drifting apart and
this must be remedied by having a large enough buffer size to accommodate for the
drift between the clocks so that the FIFO buffer does not underrun or overflow. If
we have a system with 44.1 kHz sample rate, then one new audio sample will arrive
every 22.68 µs for each audio channel. A normal oscillator (XO) like for example
the one used to generate the external clock to our DAC in Figure 3.12 can have a
frequency stability rating of ±100 parts-per-million (ppm), which means that the
clock could in the worst case be off by one in every 10 000 samples compared to an
ideal clock. If we have two clocks with the same frequency stability rating running
side by side where both clocks have maximum deviation from the ideal frequency
but in opposite directions, then the sample rate could be off by up to 8.82 samples
per second. In an hour that amounts to 31 752 samples, so if we fill the FIFO buffer
up half-way before we start extracting data from it, then it would need to be able
to fit 63 504 samples for each audio channel to guarantee uninterrupted playback
for one whole hour. The delay caused by the FIFO buffer would in that case be
0.72 s at the start of playback. While not ideal, this could be acceptable for audio
playback, but in other applications such as video streaming, the delay between the
video and the audio would just be too big unless otherwise adjusted.

Another option that has been tried together with a FIFO buffer is to use an asyn-
chronous sample rate converter (ASRC). The average incoming data rate is first
measured and then the audio data signal is resampled by the ASRC to match the
rate of the clock which extracts the data from the FIFO buffer and hands it over to
the DAC. In this way the buffer will not need to be so large as the ASRC will adjust
the audio samples by interpolation so that the average data rate for the audio data
going into the FIFO is the same as the date rate coming out of it, and the buffer
will therefore not overflow nor underrun even though the clock rates at the input
and output of the buffer might be slightly different. The use of an ASRC could

28


2. Theory

however give other undesirable audible effects depending on how well it has been
implemented. Including not only a FIFO but also an ASRC in the design also adds
to the complexity of the device.

2.3 Universal Serial Bus

The next section in this chapter will mainly focus on the parts of USB which are of
relevance for the thesis like the transfer protocol structure, the device descriptors, the
isochronous transfer modes and other audio and timing related subjects. With there
being multiple versions of the USB specification, Universal Serial Bus specification
revision 2.0 [22] is the version that the coming sections will comply with; this simply
because it is the most appropriate version considering the hardware that will be
used in the construction part of the project. In the specification there are several
attributes declared that make USB suitable to be used as a dedicated audio interface.
Among other things, guaranteed bandwidth and low latency for audio are listed as
key points. The implementations in the hardware construction part of the thesis
project are done using a “full-speed” device as defined by the USB 2.0 specification.
Subsequent revisions of the specification [23, 24] supporting SuperSpeed devices
do introduce some new concepts, but they are of no use for us here. Time units
are for example handled differently. Full backward compatibility to USB 2.0 is
however guaranteed for any full- or high-speed device connected to a host port using
SuperSpeed. Whenever mentioning the USB specification going forward, revision 2.0
is what is being intended unless explicitly stated otherwise.

2.3.1 Network Topology

A USB system is controlled by a single host, polling the bus to which devices are
connected. Devices can be grouped into different classes, such as for example the
audio device class. The communication channels between the host and a device are
called pipes, they can carry messages or stream data, and they are connected to
endpoints at the device and at the host. At a minimum a device must implement
at least one bidirectional message pipe called the default control pipe, which is
connected to the control endpoints of the device. Capabilities added to a USB
system for example in the form of an audio interface are called functions, so from
our point of view we can use the terminology for device and function interchangeably.

The network topology of USB has a tree-like structure. At one end, there is the
controlling host at which the root hub resides. Devices or other hubs can be con-
nected to the root hub, and in extension more devices and hubs can be connected
to a hub which is connected to the root hub as displayed in Figure 2.25. In order
to not violate the specifications set for timing, no more than six additional levels
following the root hub layer can be connected together.

29


2. Theory

Host

Root hub

Tier 2 deviceTier 2 hub

Tier 3 device Tier 3 hub

Tier 4 deviceTier 4 hub

Tier 5 device Tier 5 hub

Tier 6 deviceTier 6 hub

Tier 7 device

Tier 3 device

Tier 5 deviceTier 5 device

Tier 7 device

Figure 2.25: USB topology.

2.3.2 Connecting a Device to the Bus
When a device is connected to the bus, the host will need to discover and configure
it before it can perform any function. The process of configuring and enabling the
device is called enumeration, and it is done by performing a number of steps through
which the device state is altered until configuration has completed. At first, the hub
to which the device has been connected will set the device to the powered state
and report to the host that its status has changed. This will cause a query to be
sent from the host to the hub to find out what caused the change. Once the host
knows that a device has been attached, it will send a reset command and have the
port to which the device is attached to set to enabled. After the device has been
reset, it will go into the default state during which the host can communicate with
it using the default address. Following steps in the configuration process will assign
a unique address to the device, causing it to go into the address state, and finally
into the configured state once the host has read all the configuration information
in the device’s descriptor table and has assigned a configuration value to it. In the
configured state, all endpoints described in the device’s descriptor table have been
enabled and the device is ready for use.

30


2. Theory

Attached

Powered

Default

Address

Configured

Hub configured

Reset

Address assigned

Device configured

Figure 2.26: Device state changes during enumeration process.

2.3.3 Descriptors
A device reports its capabilities and settings to the host upon request through its
descriptors. The host can also change some of the device settings by altering the
values in the device’s descriptor table by device requests. A device has exactly
one main device descriptor that contains general information about the device and
it also lists one or more possible configurations of the device in the underlying
configuration descriptors. A configuration descriptor will in turn list one or more
interface descriptors, and each interface descriptor will then list one or more end-
point descriptors. When a configuration descriptor is requested by the host, it will
be returned by the device accompanied by any underlying interface and endpoint
descriptors. Interface and endpoint descriptors cannot be requested on their own.
Alternative settings for the interfaces may be provided by having multiple configu-
ration descriptors. The default control endpoint is not listed among the endpoint
descriptors as it must be implemented by all USB devices as a control pipe with
predefined settings. Endpoint descriptors declare among other things the direction
of the endpoint, if it is a control, isochronous, bulk or interrupt endpoint and what
type of synchronization it uses. Endpoints are unidirectional, but two endpoints
with the same endpoint number can be created with opposite data directions. A
feedback endpoint associated to a single isochronous endpoint is expected to have
the same endpoint number as the isochronous endpoint, and in case multiple end-
points are using the same feedback endpoint, then the endpoint number used for
feedback should be the same as for the isochronous endpoint with the lowest end-
point number associated to it. A high-speed device can also have a device_qualifier
descriptor and an other_speed_configuration descriptor. The device_qualifier de-
scriptor is similar to the device descriptor, but instead of providing information
about the device for the current speed setting, it will show device information for
the alternative speed setting. Requests for the device_qualifier descriptor will there-

31


2. Theory

fore for a device running in high-speed return the full-speed information and for a
device running in full-speed it will be the other way around. In the same way the
other_speed_configuration descriptor will return a configuration descriptor of the
alternative speed setting that the device is not currently using. An optional string
descriptor can be included to provide information in readable Unicode text format
for all devices. Figure 2.27 shows the overall structure of the descriptor table and
example descriptors for the audio device implementations used in the construction
build are provided in Appendix A.

Device descriptor

Configuration descriptor(s)

Interface descriptor(s)

Endpoint descriptor(s)

Figure 2.27: Configuration, interface and endpoint descriptor structure.

2.3.4 Device Classes and Device Requests
The descriptors on the device are made accessible to the host by replies to device
requests sent from the host to the default control pipe of the device. Device requests
can be of standard type, which are supported by all devices, or they can be class
or vendor specific. Device requests are used to either fetch values from a device
descriptor or to manipulate the values in them. The prefixes “GET” and “SET”
are used in the request name to indicate if a request is meant to retrieve or change
the descriptor data that is being referenced. The only standard device requests that
do not use the two mentioned prefixes are the CLEAR_FEATURE request, which
is used to switch off on-off toggle values and the SYNCH_FRAME request, which
is used to synchronize the host and the device when the size of the transferred
data varies within a frame. A device request transaction will follow the pattern
for a control transfer with an initiating SETUP packet, an optional data packet
depending on the request type, and a closing handshake like depicted in Figure 2.35
in Chapter 2.3.10.2. If a device receives an invalid or unsupported request, it should
respond appropriately and signal the error by setting the packet identifier to STALL
either in the following data packet or the next status transaction. The USB audio
device class is an extension of the USB standard, so a USB audio class device will
have a number of extra descriptors containing information about the device’s audio
capabilities and it will also on top of the standard device requests support requests
from the USB audio class specification. Software on the host communicating with an
audio class device can use the standard audio class driver provided by the operating
system, but it is also possible to load an external driver specific to the device and
use that one instead of the generic driver.

32


2. Theory

2.3.5 Transfer Types
Most transactions on the bus consist of three interactions: 1) The host sends a
“token packet” with parameters to set up a transaction with one of the connected
devices. 2) An attempt to transfer the requested data is made, either in the direction
from the device to the host or from the host to the device. 3) The receiver of the
data sends a “handshake packet” to indicate if the data was transferred successfully
or not.

There are four types of data transfers that can take place:

Data transfer type Usage
Control transfer Used for device configuration, commands and status re-

quests.
Bulk transfer Non-periodic transmission of non-time sensitive data, usu-

ally sent in larger chunks.
Interrupt transfer Transmission of smaller amounts of time sensitive data

that must be delivered reliably.
Isochronous transfer Periodic transmission of real-time data with minimal de-

lay.

Table 2.5: Data transfer types for USB.

We will not make any use of the bulk transfer mode or the interrupt transfer mode
in any USB audio class devices described in this thesis and therefore no time will be
spent on expanding the discussion around those subjects. Isochronous data transfer
is the transfer mode used by USB audio devices to move audio data, so that is of
most interest to us. Control transfers are also to some degree used by USB audio
devices for supportive functionality.

2.3.5.1 Control Transfers

Control transfers allow the host software to configure and control device functions
using the default control pipe of the device. Additional message pipes for control
transfers used for other device specific purposes can be defined but are not obligatory.
Requests to alter the device settings can be either standard, device class, or vendor
specific. Error free message delivery is guaranteed for this transfer type and bus
access is granted in a best effort manner. Time is reserved for control transfers on
the bus, but that time reservation is shared between all the connected devices and
it is not limited to a single device.

2.3.5.2 Isochronous Transfers

The characteristics of the isochronous transfer mode makes it the most suitable of
the USB transfer types for transmission of data like audio which is consumed in
real-time. USB guarantees periodic access to the bus for isochronous data transfers

33


2. Theory

with an upper bound on the maximum allowed latency. The latency of the trans-
mitted data will depend on the amount of buffering that is done at each stage in
the transmission chain. No retransmissions are made for any data lost in transmis-
sion errors, but the receiver can still discover if a transmission error has occurred
by keeping track of start-of-frame (SOF) count, expected delivery interval, cyclic
redundancy check (CRC) field of packets and if it is a high-speed high bandwidth
device, then also the packet ID sequencing can be used. The number of transmission
errors occurring is however expected to be low enough to not cause any problems.
As a side note, a recommended bit error rate of less than or equal to 10-12 for a
high-speed receiver is mentioned as a design guideline in the section for electrical
characteristics in the USB specification.

2.3.6 Time Units

For full-speed devices, USB divides time into units of 1 ms called frames. High-
speed devices are able to use a narrower time span of 125 µs called a microframe.
Each new frame is defined by the host sending out a SOF packet every 1 ms ±0.5 µs
that devices can use for synchronization. The same generation rate of SOFs for
microframes is set to 125 µs±0.0625 µs by the USB specification.

2.3.7 Bus Access Period

A device using isochronous transfers must at the time of being connected to the
bus inform the host software of its desired bus access period so that bandwidth can
be allocated to accommodate the required data rate. This is done by setting an
appropriate value in the bInterval field of the device’s standard endpoint descriptor.
Valid values for isochronous endpoints are between 1–16 and the formula

I = (2bInterval−1)F (2.20)

expresses the desired polling interval in frames or microframes. F is the frequency
of one frame or microframe depending on the speed of the connected device, so for
a high-speed device F is 125 µs and for a full-speed device it is 1 ms. For a high-
speed high bandwidth device up to three transactions can take place during one
microframe, but the host may not always be able to fulfill the desired access interval
of the device. Figure 2.6 shows the packet ID sequencing depending on the number
of packets sent to the high-speed high bandwidth device during a microframe. By
keeping track of the bit sequence in the packet ID field of the received packet, the
device can detect if a packet is missing and if that is case then all data sent during
that same microframe should be treated as incomplete.

34


2. Theory

Number of transactions per microframe 1st packet 2nd packet 3rd packet
1 transaction DATA0
2 transactions MDATA DATA1
3 transactions MDATA MDATA DATA2

Table 2.6: Packet ID sequencing for a high-speed high bandwidth device receiving
isochronous data from host.

2.3.8 Endpoint Buffering

Before creating, configuring and allocating bandwidth to an isochronous stream
pipe for a device, the USB host software will calculate the amount of time that the
isochronous transactions are going to take to make sure that the needs of all devices
sharing the bus can be accommodated. When data is sent through an isochronous
stream pipe, it is first accumulated in a memory buffer and then transmitted in
larger chunks in the form of packets. There must also be a buffer at the endpoint
receiving the packets that can hold them until the device is ready to process them.
As a rule of thumb, the recommendation is that the size of the buffers at both
endpoints should be large enough to be able to fit twice the amount of data that can
be sent during one frame for a full-speed device, or one microframe for a high-speed
device. The larger the buffers are, the bigger the latency in the audio chain is. An
appropriate buffer size Bsize can be obtained by the formula

Bsize = 2S


Fs

FSOF

I

 , (2.21)

where Fs is the sample rate frequency of the system, FSOF is the frequency of the
USB clock, I is the polling interval from Equation 2.20 and S is the sample size of
the device.

2.3.9 Prebuffering Delay

The way in which USB processes isochronous data through the buffers at each of the
endpoints when it is transferred from source to sink will inherently add a delay. At
the source, data will be accumulated and buffered during frame X for a full-speed
endpoint until the SOF for frame X+1 is transmitted. The data in the buffer from
frame X will then be sent during frame X+1 to the buffer at the sink endpoint. First
when the SOF for frame X+2 appears can the sink start processing the data that
was accumulated during frame X at the source. Figure 2.28 displays the buffering
delay. The same applies for a high-speed endpoint, but instead of frames the time
unit used is microframes.

35


2. Theory

Frame number

Data accumulated at source

Data sent on bus

Data processed at sink

F1 F2 F3 F4
... Fx

D1 D2 D3 D4
... Dx

− D1 D2 D3
... Dx−1

− − D1 D2
... Dx−2

Figure 2.28: Delay induced due to prebuffering at endpoints.

2.3.10 Transfer of Data
Data is sent on the bus using non return to zero inverted (NRZI) encoding with the
LSb first. The polarity of the NRZI encoded signal changes for every data bit that
is “zero” and remains the same for every data bit that is “one”. Like biphase mark
encoding used by S/PDIF and AES, NRZI has the same benefit of only having a
small DC component. A separate signal line for a clock is likewise also not needed as
the receiver can create the sample clock by itself. As a long series of data containing
nothing but ones produces a NRZI encoded signal that has no transitions from high
to low or vice versa until the next “one” in the data appears, extra bits are inserted
into the NRZI encoded data to guarantee that a signal transition happens at least
every 7th bit. This is enough to ensure that the receiver can lock on to the signal.
Any i