Digital Audio Interface Jitter Master’s thesis in Electrical Engineering FREDRIK SINKKONEN Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Digital Audio Interface Jitter FREDRIK SINKKONEN Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Digital Audio Interface Jitter FREDRIK SINKKONEN © FREDRIK SINKKONEN, 2024. Supervisor: Morten Fjeld, Department of Computer Science and Engineering Examiner: Morten Fjeld, Department of Computer Science and Engineering Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2024 iv Digital Audio Interface Jitter FREDRIK SINKKONEN Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Jitter is the short-term deviation of a digital signal from its ideal position in time. Some common issues know to produce jitter in currently used digital audio inter- face formats were examined and multiple implementations of a Universal Serial Bus (USB) audio interface were designed with the intention of creating a device free from interface jitter. Using the three standardized clock synchronization mechanisms in the USB protocol for isochronous transmissions and a selection of suitable clock sources, USB audio class devices were created for which jitter measurements then were performed. The results were compared with jitter audibility thresholds from three studies containing listening tests. While all implementations were functionally acceptable, their jitter results did differ. For the two isochronous synchronization modes of USB that require a continuously adjustable clock source on the receiving side of the interface the jitter issue consists of two parts. Periodic adjustments of the clock signal are in itself a source of jitter and the way in which an adjustable clock source is constructed is another. The initial core idea was that a USB audio inter- face using isochronous transfers coupled with the asynchronous clock synchroniza- tion mode and a fixed frequency clock source would be able to provide an interface in which no additional jitter on top of the inherent jitter level of the source clock would be added by the transfer of data over the interface. The two fixed frequency clocks that were used did however not perform any better than the results of the best adjustable clock source and when they were attached to the test system their jitter levels increased even further. Analysis of the jitter measurements point in the direction of asynchronous mode being preferable for lowest possible jitter levels but the results are not completely unambiguous and jitter levels below the lowest recorded hearing thresholds were also achieved with one of the other synchronization modes for isochronous USB transfers. Keywords: Asynchronous, Audio, Clock, DAC, Digital, Interface, Jitter, PSoC, S/PDIF, USB. v Contents List of Figures xi List of Tables xvii Terms and Abbreviations xix 1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . 1 1.2 Scope and Delimitations . . . . . . . . . . . . . . . . . . . . . 2 1.3 Functional Requirements . . . . . . . . . . . . . . . . . . . . . 2 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Definition of Jitter . . . . . . . . . . . . . . . . . . . . 5 2.1.2 What Causes Jitter? . . . . . . . . . . . . . . . . . . . 7 2.1.3 Probability Theory for Jitter Distributions . . . . . . . 7 2.1.4 Jitter Types . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.4.1 Random Jitter . . . . . . . . . . . . . . . . . 11 2.1.4.2 Periodic Jitter . . . . . . . . . . . . . . . . . 12 2.1.4.3 Data Dependent Jitter . . . . . . . . . . . . . 12 2.1.4.4 Duty Cycle Distortion . . . . . . . . . . . . . 13 2.1.4.5 Bounded Uncorrelated Jitter . . . . . . . . . 14 2.1.5 Audibility of Jitter . . . . . . . . . . . . . . . . . . . . 14 2.1.5.1 A Theoretical Jitter Audibility Model . . . . 19 2.2 AES/EBU and S/PDIF . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Biphase Mark Code . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Clock Recovery . . . . . . . . . . . . . . . . . . . . . . 22 2.2.3 Asymmetric Slew Rates . . . . . . . . . . . . . . . . . 24 2.2.4 Transmission Lines . . . . . . . . . . . . . . . . . . . . 25 2.2.5 FIFO Buffers . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Universal Serial Bus . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.1 Network Topology . . . . . . . . . . . . . . . . . . . . 29 2.3.2 Connecting a Device to the Bus . . . . . . . . . . . . . 30 2.3.3 Descriptors . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.4 Device Classes and Device Requests . . . . . . . . . . . 32 2.3.5 Transfer Types . . . . . . . . . . . . . . . . . . . . . . 33 vii Contents 2.3.5.1 Control Transfers . . . . . . . . . . . . . . . . 33 2.3.5.2 Isochronous Transfers . . . . . . . . . . . . . 33 2.3.6 Time Units . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.7 Bus Access Period . . . . . . . . . . . . . . . . . . . . 34 2.3.8 Endpoint Buffering . . . . . . . . . . . . . . . . . . . . 35 2.3.9 Prebuffering Delay . . . . . . . . . . . . . . . . . . . . 35 2.3.10 Transfer of Data . . . . . . . . . . . . . . . . . . . . . 36 2.3.10.1 Packet Types . . . . . . . . . . . . . . . . . . 36 2.3.10.2 Packet Fields . . . . . . . . . . . . . . . . . . 37 2.3.11 Isochronous Synchronization Types . . . . . . . . . . . 41 2.3.11.1 Synchronous . . . . . . . . . . . . . . . . . . 41 2.3.11.2 Adaptive . . . . . . . . . . . . . . . . . . . . 41 2.3.11.3 Asynchronous . . . . . . . . . . . . . . . . . . 41 2.3.12 Explicit Feedback . . . . . . . . . . . . . . . . . . . . . 41 2.3.13 The Audio Device Class . . . . . . . . . . . . . . . . . 43 2.3.13.1 Clocks, Time and Synchronization . . . . . . 43 2.3.13.2 Entities . . . . . . . . . . . . . . . . . . . . . 44 2.3.13.3 Audio Class Descriptors . . . . . . . . . . . . 44 2.3.13.4 Audio Class Requests . . . . . . . . . . . . . 46 2.3.13.5 Audio Class Definition 2.0 . . . . . . . . . . . 46 2.4 Inter-IC Sound . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5 Frequency Synthesizers . . . . . . . . . . . . . . . . . . . . . . 48 2.6 Fractional Dividers . . . . . . . . . . . . . . . . . . . . . . . . 49 3 System Design and Implementation 51 3.1 Hardware Selection . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Development Environment . . . . . . . . . . . . . . . . . . . . 52 3.2.1 Monitoring of Device Operation . . . . . . . . . . . . . 52 3.2.2 The PSoC Clock System . . . . . . . . . . . . . . . . . 52 3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1 Common Design Layout . . . . . . . . . . . . . . . . . 54 3.3.2 Asynchronous Mode Implementation . . . . . . . . . . 57 3.3.3 Adaptive Mode Implementation . . . . . . . . . . . . . 62 3.3.4 Synchronous Mode Implementation . . . . . . . . . . . 67 4 Results 71 4.1 Functional Results . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.1 Functional Results for Asynchronous Mode . . . . . . . 71 4.1.2 Functional Results for Adaptive Mode . . . . . . . . . 76 4.1.3 Functional Results for Synchronous Mode . . . . . . . 80 4.2 Jitter Measurements . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.1 Discussion of Measurement Results . . . . . . . . . . . 84 4.2.2 Start-of-Frame Jitter . . . . . . . . . . . . . . . . . . . 92 5 Conclusion 95 5.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.1 Sustainability and Environmental Considerations . . . 97 viii Contents Bibliography 99 A USB Descriptors I A.1 Asynchronous Mode USB Descriptor Table . . . . . . . . . . . I A.2 Adaptive Mode USB Descriptors . . . . . . . . . . . . . . . . IV A.3 Synchronous Mode USB Descriptors . . . . . . . . . . . . . . . V B Register Maps for Si5351 VII B.1 Register Map Generated by ClockBuilder Pro . . . . . . . . . VII B.2 Manually Generated Register Values . . . . . . . . . . . . . . VIII C Jitter Histograms XI C.1 Asynchronous Mode Jitter Histograms . . . . . . . . . . . . . XI C.1.1 Asynchronous Mode with Si5351 Integer Multisynth . . XI C.1.2 Asynchronous Mode with External Crystal Oscillator . XIV C.1.3 Asynchronous Mode with Fixed Frequency Clock . . . XVI C.1.4 Asynchronous Mode with Custom Fractional Divider . XIX C.1.5 Asynchronous Mode with Si5351 Integer Multisynth . . XXI C.2 Adaptive Mode Jitter Histograms . . . . . . . . . . . . . . . . XXIV C.2.1 Adaptive Mode with Si5351 Integer Multisynth . . . . XXIV C.2.2 Adaptive Mode with Si5351 Fractional Multisynth . . . XXVI C.3 Synchronous Mode Jitter Histograms . . . . . . . . . . . . . . XXIX C.3.1 Synchronous Mode with Custom Fractional Divider . . XXIX ix Contents x List of Figures 2.1 Jitter assessment for a clock signal having a reference. . . . . . 6 2.2 Jitter assessment for a solitary clock signal. . . . . . . . . . . . 6 2.3 Visualization of how noise in the voltage domain can produce jitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Theoretical example of a cumulative distribution function for a clock signal with ideal transition time at τi. . . . . . . . . . . 8 2.5 The probability density function corresponding to the cumulative distribution function in Figure 2.4. . . . . . . . . . . . . . . . 9 2.6 The probability density function from Figure 2.5 divided into time brackets with an arbitrary point in time τp selected. . . . 9 2.7 A closeup of the probability density function in Figure 2.6 around the arbitrarily selected point in time τp. . . . . . . . . . . . . . 10 2.8 Jitter components contributing to total jitter. . . . . . . . . . 11 2.9 Probability density function for random jitter. . . . . . . . . . 11 2.10 The probability density function for a sinusoidal periodic jitter distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.11 Typical probability density function for data dependent jitter. 13 2.12 Probability density function for duty cycle distortion. . . . . . 14 2.13 The size of an amplitude error caused by a timing error depends on the slope of the signal. . . . . . . . . . . . . . . . . . . . . 15 2.14 Analog sine wave and the resulting waveform after quantization in low resolution. . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.15 Data structure of AES/EBU and S/PDIF. The contents of a) an audio block, b) a frame and c) a subframe. . . . . . . . . . . . 21 2.16 Biphase mark code timing diagram. . . . . . . . . . . . . . . . 22 2.17 First order passive low pass filter used to simulate bandwidth limited transmission channel. . . . . . . . . . . . . . . . . . . 23 2.18 Transmission of one subframe over bandwidth limited channel. 23 2.19 First eight bits of the subframe. . . . . . . . . . . . . . . . . . 24 2.20 Bits four and five of the subframe. . . . . . . . . . . . . . . . . 24 2.21 Symmetric versus asymmetric slew rate response to an ideal square wave. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.22 Cascaded network model for high frequency transmission line. 25 2.23 Transmission line with load attached. . . . . . . . . . . . . . . 26 2.24 High frequency voltage pulse applied to a transmission line. . . 27 2.25 USB topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 xi List of Figures 2.26 Device state changes during enumeration process. . . . . . . . 31 2.27 Configuration, interface and endpoint descriptor structure. . . 32 2.28 Delay induced due to prebuffering at endpoints. . . . . . . . . 36 2.29 Example of NRZI encoded data. . . . . . . . . . . . . . . . . . 36 2.30 Isochronous USB OUT transaction sequence. . . . . . . . . . . 39 2.31 USB OUT token packet. . . . . . . . . . . . . . . . . . . . . . 39 2.32 USB DATAx packet. . . . . . . . . . . . . . . . . . . . . . . . 39 2.33 USB SOF packet transaction sequence. . . . . . . . . . . . . . 39 2.34 USB SOF packet. . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.35 USB control transfer sequence. . . . . . . . . . . . . . . . . . . 40 2.36 USB SETUP packet used in a control transfer. . . . . . . . . . 40 2.37 USB DATAx packet used in a control transfer. . . . . . . . . . 40 2.38 USB ACK handshake packet used in a control transfer. . . . . 40 2.39 Feedback format for full-speed endpoint. . . . . . . . . . . . . 43 2.40 Feedback format for high-speed endpoint. . . . . . . . . . . . . 43 2.41 I2S transmitter and receiver pair with the transmitter having the role of the controller. . . . . . . . . . . . . . . . . . . . . . . . 47 2.42 I2S interface timing diagram. . . . . . . . . . . . . . . . . . . . 48 2.43 Block diagram of a phase-locked loop. . . . . . . . . . . . . . . 49 2.44 Example of fractional division by 2.4. . . . . . . . . . . . . . . 50 3.1 The PSoC core clocking network. . . . . . . . . . . . . . . . . 53 3.2 Layout of the audio path and the physical entities. . . . . . . 55 3.3 General transfer sequence for USB audio transmissions. . . . . 55 3.4 Audio data transfer path inside the PSoC device. . . . . . . . 56 3.5 I2C single register write operation. . . . . . . . . . . . . . . . 57 3.6 I2C burst write operation to two consecutive registers. . . . . . 58 3.7 Programming procedure for the Adafruit Si5351 external clock generator board. . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.8 Feedback array with integer part populated. . . . . . . . . . . 59 3.9 Bit values around the decimal point in the feedback array. . . 60 3.10 Feedback array with the fractional part populated. . . . . . . 61 3.11 The complete USB feedback array. . . . . . . . . . . . . . . . 61 3.12 Crystal oscillator circuit used for generating external clock. . . 62 3.13 Si5351 block diagram. . . . . . . . . . . . . . . . . . . . . . . 63 3.14 Clock configuration for the synchronous mode implementation. 69 4.1 Buffer fill level during four hour long playback session using asyn- chronous mode. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Accumulative error without correction of buffer fill. . . . . . . 74 4.3 Accumulative error with correction of buffer fill. . . . . . . . . 74 4.4 Accumulative error without correction and with increased num- ber of buffer chunks. . . . . . . . . . . . . . . . . . . . . . . . 75 4.5 Increasing the number of buffer chunks to twenty for asynchronous mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Buffer fill level for USB in adaptive mode with the multisynth in integer mode without clock adjustment. . . . . . . . . . . . . . 78 xii List of Figures 4.7 Buffer fill level for USB in adaptive mode with the multisynth in integer mode and clock adjustment activated. . . . . . . . . . 78 4.8 Buffer fill level for USB in adaptive mode with the multisynth in fractional mode without clock adjustment. . . . . . . . . . . . 79 4.9 Buffer fill level for USB in adaptive mode with the multisynth in fractional mode and clock adjustment activated. . . . . . . . . 79 4.10 Buffer fill level for USB in synchronous mode with frequency updating turned on. . . . . . . . . . . . . . . . . . . . . . . . . 81 4.11 Buffer fill level for USB in synchronous mode with frequency updates turned off. . . . . . . . . . . . . . . . . . . . . . . . . 82 4.12 PSoC IMO clock pulses registered per SOF. . . . . . . . . . . 82 4.13 Capture of one measurement each for the I2S output signals. . 84 4.14 Capture of one measurement for the I2S input signal. . . . . . 84 4.15 Period peak-to-peak jitter for the I2S component input and SCK and WS output clocks for each of the implementations. . . . . 88 4.16 Cycle-to-cycle peak jitter for the I2S component input and SCK output clocks for each of the implementations. . . . . . . . . . 88 4.17 Period histogram for two adjacent adjustment levels for the I2S component input clocks in USB adaptive mode with the multi- synth set to integer mode. . . . . . . . . . . . . . . . . . . . . 89 4.18 Period histograms with and without clock adjustments for the I2S component input clock in USB asynchronous mode using the IMO together with the fractional divider as clock source. . . . 91 4.19 The range of period jitter peak-to-peak values for all three I2S component clocks compared to the theoretical jitter audibility model and the jitter audibility thresholds determined by listening tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.20 Start-of-frame packets received by the PSoC USBFS component. 93 4.21 Histogram of SOF arrival time variation referenced to IMO clock data from Figure 4.12. . . . . . . . . . . . . . . . . . . . . . . 93 C.1 I2S component input clock period jitter histogram for asynchronous mode USB with the Si5351 multisynth set to fractional mode. XI C.2 I2S component input clock cycle-to-cycle jitter histogram for asyn- chronous mode USB with the Si5351 multisynth set to fractional mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XII C.3 I2S SCK signal period jitter histogram for asynchronous mode USB with the Si5351 multisynth set to fractional mode. . . . . XII C.4 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous mode USB with the Si5351 multisynth set to fractional mode. XIII C.5 I2S WS signal period jitter histogram for asynchronous mode USB with the Si5351 multisynth set to fractional mode. . . . . XIII C.6 I2S component input clock period jitter histogram for asynchronous mode USB using the external XO as source clock. . . . . . . . XIV C.7 I2S component input clock cycle-to-cycle jitter histogram for asyn- chronous mode USB using the external XO as source clock. . . XIV xiii List of Figures C.8 I2S SCK signal period jitter histogram for asynchronous mode USB using the external XO as source clock. . . . . . . . . . . XV C.9 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous mode USB using the external XO as source clock. . . . . . . . XV C.10 I2S WS signal period jitter histogram for asynchronous mode USB using the external XO as source clock. . . . . . . . . . . XVI C.11 I2S component input clock period jitter histogram for asynchronous mode USB using the external fixed frequency clock board as source clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVI C.12 I2S component input clock cycle-to-cycle jitter histogram for asyn- chronous mode USB using the external fixed frequency clock board as source clock. . . . . . . . . . . . . . . . . . . . . . . XVII C.13 I2S SCK signal period jitter histogram for asynchronous mode USB using the external fixed frequency clock board as source clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVII C.14 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous mode USB using the external fixed frequency clock board as source clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVIII C.15 I2S WS signal period jitter histogram for asynchronous mode USB using the external fixed frequency clock board as source clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XVIII C.16 I2S component input clock period jitter histogram for asynchronous mode USB using the IMO as source clock together with the frac- tional divider component. . . . . . . . . . . . . . . . . . . . . XIX C.17 I2S component input clock cycle-to-cycle jitter histogram for asyn- chronous mode USB using the IMO as source clock together with the fractional divider component. . . . . . . . . . . . . . . . . XIX C.18 I2S SCK signal period jitter histogram for asynchronous mode USB using the IMO as source clock together with the fractional divider component. . . . . . . . . . . . . . . . . . . . . . . . . XX C.19 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous mode USB using the IMO as source clock together with the frac- tional divider component. . . . . . . . . . . . . . . . . . . . . XX C.20 I2S WS signal period jitter histogram for asynchronous mode USB using the IMO as source clock together with the fractional divider component. . . . . . . . . . . . . . . . . . . . . . . . . XXI C.21 I2S component input clock period jitter histogram for asynchronous mode USB with the Si5351 multisynth set to integer mode. . . XXI C.22 I2S component input clock cycle-to-cycle jitter histogram for asyn- chronous mode USB with the Si5351 multisynth set to integer mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXII C.23 I2S SCK signal period jitter histogram for asynchronous mode USB with the Si5351 multisynth set to integer mode. . . . . . XXII C.24 I2S SCK signal cycle-to-cycle jitter histogram for asynchronous mode USB with the Si5351 multisynth set to integer mode. . . XXIII xiv List of Figures C.25 I2S WS signal period jitter histogram for asynchronous mode USB with the Si5351 multisynth set to integer mode. . . . . . XXIII C.26 I2S component input clock period jitter histogram for adaptive mode USB with the Si5351 multisynth set to integer mode. . . XXIV C.27 I2S component input clock cycle-to-cycle jitter histogram for adap- tive mode USB with the Si5351 multisynth set to integer mode. XXIV C.28 I2S SCK signal period jitter histogram for adaptive mode USB with the Si5351 multisynth set to integer mode. . . . . . . . . XXV C.29 I2S SCK signal cycle-to-cycle jitter histogram for adaptive mode USB with the Si5351 multisynth set to integer mode. . . . . . XXV C.30 I2S WS signal period jitter histogram for adaptive mode USB with the Si5351 multisynth set to integer mode. . . . . . . . . XXVI C.31 I2S component input clock period jitter histogram for adaptive mode USB with the Si5351 multisynth set to fractional mode. XXVI C.32 I2S component input clock cycle-to-cycle jitter histogram for adap- tive mode USB with the Si5351 multisynth set to fractional mode. XXVII C.33 I2S SCK signal period jitter histogram for adaptive mode USB with the Si5351 multisynth set to fractional mode. . . . . . . . XXVII C.34 I2S SCK signal cycle-to-cycle jitter histogram for adaptive mode USB with the Si5351 multisynth set to fractional mode. . . . . XXVIII C.35 I2S WS signal period jitter histogram for adaptive mode USB with the Si5351 multisynth set to fractional mode. . . . . . . . XXVIII C.36 I2S input clock period jitter histogram for synchronous mode USB using the IMO as source clock with the fractional divider component. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXIX C.37 I2S input clock cycle-to-cycle jitter histogram for synchronous mode USB using the IMO as source clock with the fractional divider component. . . . . . . . . . . . . . . . . . . . . . . . . XXIX C.38 I2S SCK signal period jitter histogram for synchronous mode USB using the IMO as source clock with the fractional divider component. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXX C.39 I2S SCK signal cycle-to-cycle jitter histogram for synchronous mode USB using the IMO as source clock with the fractional divider component. . . . . . . . . . . . . . . . . . . . . . . . . XXX C.40 I2S WS signal period jitter histogram for synchronous mode USB using the IMO as source clock with the fractional divider com- ponent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXXI xv List of Figures xvi List of Tables 2.1 Range of calculated jitter audibility thresholds for all test partic- ipants when playing sine wave tones with added sinusoidal jitter. 17 2.2 Range of jitter audibility thresholds recorded for all test par- ticipants when playing program material with added sinusoidal jitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Proportion of test participants that were able to hear the effects of random jitter added to self selected source material. . . . . 18 2.4 Description of subframe data fields. . . . . . . . . . . . . . . . 21 2.5 Data transfer types for USB. . . . . . . . . . . . . . . . . . . . 33 2.6 Packet ID sequencing for a high-speed high bandwidth device receiving isochronous data from host. . . . . . . . . . . . . . . 35 2.7 USB token packets. . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8 USB handshake packets. . . . . . . . . . . . . . . . . . . . . . 37 2.9 USB packet fields. . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 The fixed frequencies at which the IMO can be operated at. . 53 3.2 Calculation of the fractional bit values. . . . . . . . . . . . . . 61 3.3 PLL A settings in the register mapping. . . . . . . . . . . . . 65 3.4 Multisynth 0 settings in the register mapping. . . . . . . . . . 66 4.1 Adjustment of the capture value depending on buffer fill level. 72 4.2 Clock frequency adjustment based on buffer fill level. . . . . . 77 4.3 Period peak-to-peak and cycle-to-cycle maximum values from the jitter measurements. . . . . . . . . . . . . . . . . . . . . . . . 87 B.1 Register values generated by ClockBuilder Pro for the Adafruit Si5351 clock generator. . . . . . . . . . . . . . . . . . . . . . . VII B.2 Manually calculated register values used in adaptive mode for the Adafruit Si5351 clock generator PLL A block. . . . . . . . VIII B.3 Manually calculated register values used in adaptive mode for the Adafruit Si5351 clock generator multisynth 0 block with the integer bit set. . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII B.4 Manually calculated register values used in adaptive mode for the Adafruit Si5351 clock generator multisynth 0 block with the fractional bit set. . . . . . . . . . . . . . . . . . . . . . . . . . IX xvii List of Tables xviii Terms and Abbreviations This section lists terms and abbreviations that are used throughout the thesis. ADC Analog-to-digital converter AES/EBU Digital audio transfer interface standard for professional use created by the Audio Engineering Society (AES) and the European Broad- casting Union (EBU). Interchangeably sometimes called AES3. API Application programming interface ASRC Asynchronous sample rate converter BMC Biphase mark code CD Compact disc CDF Cumulative distribution function CRC Cyclic redundancy check DAC Digital-to-analog converter DC Direct current DMA Direct memory access DSI Digital system interconnect FF Fixed frequency FIFO First in, first out GUI Graphical user interface I2C Inter-integrated circuit I2S Inter-IC sound IAD Interface association descriptor IC Integrated circuit IDE Integrated design environment xix Terms and Abbreviations IEC International Electrotechnical Commission IMO Internal main oscillator LSb Least significant bit MCKL Master clock MSb Most significant bit NRZI Non return to zero invert PDF Probability density function PID Packet identifier PLL Phase-locked loop ppm Parts-per-million PSoC Programmable system-on-chip Red book The compact disc digital audio specification is printed in a book that has a red cover. Hence the term “Red book” refers to regular CD audio format. S/PDIF Sony/Philips digital interface. Digital audio transfer interface stan- dard based on AES/EBU and intended for consumer audio prod- ucts. SCK Serial clock SD Serial data SOF Start-of-frame SWD Serial wire debug TIE Time interval error TX Transmit UART Universal asynchronous receiver/transmitter UDB Universal digital block USB Universal Serial Bus VCO Voltage controlled oscillator WS Word select XO Oscillator XTAL Crystal xx 1 Introduction This chapter introduces the topic of the thesis and gives an explanation as to why it was selected. It starts out with a brief description of the background and mo- tivation for the thesis subject, followed by scope and delimitations along with the requirements for the intended hardware and software build. Lastly, an overview of the structure of the rest of the report is provided. 1.1 Background and Motivation Digital audio data can be created by sampling and quantizing an analog sound wave by the use of an analog-to-digital converter (ADC). The number of bits used for each sample will determine the resolution, or how accurately the amplitude of the analog signal can be represented in digital form, and the sampling frequency will as described by the sampling theorem [1] effectively put an upper bound on the frequency range that can be sampled and stored digitally by the ADC. To play back the digitally stored audio, the digital audio data can be fed together with a clock signal that matches the sampling rate into a digital-to-analog converter (DAC), which then consequently converts the digital signal to an analog one that can be sent to a speaker, usually first passing through an amplifier. A digital signal is less susceptible to interference than what an analog signal is, so there is a motive in trying to keep the audio in the digital domain for as long as possible before eventually having to convert it to analog format so that it becomes audible. This sometimes means that it will be necessary to transfer the digital signal between different audio devices, creating a need for a robust transfer protocol and a digital audio interface that will keep the audio data intact during the transfer. Any bit errors introduced into the audio data could severely degrade the sound quality, so they must be avoided. Most digital audio interfaces can be considered reliable when it comes to this characteristic [2], but it is not only the audio data itself which must be preserved in the transfer process; the clock signal, which is sent in parallel with the audio data into the DAC must also remain unaffected by the transfer from one device to another, as small timing errors in the clock signal, defined as jitter [3] can lead to subtle but still audible effects [4]. This is however something which at times has been neglected. The AES/EBU interface [5] created for professional use by the Audio Engineer- 1 1. Introduction ing Society (AES) and the European Broadcasting Union (EBU), and its equivalent counterpart for consumer audio devices, the Sony/Philips digital interface (S/PDIF) [6], standardized in IEC 60958 [7] by the International Electrotechnical Commission (IEC), both exhibit weaknesses in how the clock signal is handled [2]. Universal Serial Bus (USB) has over time gained popularity as a dedicated digital audio inter- face, but its characteristics when it comes to jitter performance in the clock signal depend largely on how the interface is implemented [8]. A separate clock signal is not per se sent for USB audio, but there is still a need to keep the clocks at both sides of the interface synchronized. Historically, for audio companies recognizing the problem with jitter being introduced into the clock signal by AES/EBU and S/PDIF, the selected course of action has often been to keep using that same digital audio interface design which introduces the jitter in the first place, and then with various methods [9] try to remove or reduce the jitter from the clock signal again once the transfer of the digital audio data has been completed. The result of this approach has often been added hardware complexity and increased development and manufacturing costs, while it still remains questionable if the jitter has been removed or diminished to sufficiently low levels. A better approach it seems would be to use an audio interface design which does not introduce interface jitter into the clock signal to begin with. In theory, such an interface can be created utilizing a USB audio device class AudioStreaming interface [10] configured to run in asynchronous synchronization mode [8, 9]. 1.2 Scope and Delimitations The aim for this project is to see how digital audio interfaces can be implemented and then try to build a digital audio interface which does not add jitter to the clock signal in the process of transferring digital audio from one device to another. The interface should also be able to transfer the digital audio data reliably without bit errors or significant delay in the signal path. This is to be accomplished by implementing a USB audio device class AudioStreaming interface using a Cypress programmable system-on-chip (PSoC) development board and microcontroller. The build requires both hardware and software design in order to produce a testing platform. Jitter measurements are also to be performed and presented but no listening tests or any other kind of auditory assessment regarding jitter and its impact on audio quality will be made. 1.3 Functional Requirements The interface should fulfill the following functional requirements: • Support standard Red Book 2-channel Compact Disc (CD) audio data as input; 16 bits per sample at a sample rate of 44.1 kHz. • Use inter-IC sound (I2S) output format for the received audio data so that it can be sent to a DAC to be converted to analog and played back. 2 1. Introduction • Follow interconnect and interface standards so that the test device can be plugged into an audio source without need for special ports, cables, drivers or software. No additional USB audio functions except for the necessary AudioControl and Au- dioStreaming interfaces and their associated endpoints will be implemented in the USB module. 1.4 Outline The rest of the thesis is organized as follows. Chapter 2 contains theory about jitter audibility, statistics, and the taxonomy of jitter types. A large part of the chapter is also devoted to a walkthrough of the relevant audio interface formats and their char- acteristics. In Chapter 3, the common design layout for all implementation modes and the specifics for each one of them are described. The clock configuration is being examined in detail while other parts of the system setup are treated in a more gen- eral sense. Chapter 4 shows the functional results for the different implementation modes together with measurements of the jitter levels, presented both in numbers and visualized as histograms. The results and their validity are then evaluated. The report ends with Chapter 5 containing a summary of the findings and sustainability and environmental considerations. A listing of the device descriptors used for the synchronous, asynchronous and adaptive mode USB audio interfaces can be found in Appendix A, programmed register settings for the external clock generator board are located in Appendix B and in Appendix C are histograms from the period and cycle-to-cycle jitter measurements. 3 1. Introduction 4 2 Theory This chapter starts out by defining what jitter is and the possible causes to its existence in a system. Then follows an introduction to statistical theory and a characterization of the different jitter types. An overview of previous work related to jitter audibility testing and audibility threshold theory is presented and we will also get more acquainted with a selection of some commonly used audio interface formats. Particular attention is being paid to some of the possible sources of jitter often being associated with the AES/EBU and S/PDIF interfaces. The chapter ends with a look at clock generation with phase-locked loops (PLLs) and fractional dividers. 2.1 Jitter The following section of this chapter defines what is meant by jitter, it introduces some statistical terminology for jitter distributions and it also characterizes the dif- ferent kinds of jitter that we may encounter. Conducting listening tests of any kind is out of scope for this thesis but results and observations from auditory assess- ments made in other studies are presented and one purely theoretical jitter audibility threshold model is also provided. 2.1.1 Definition of Jitter Jitter can be defined as the short-term time displacement a digital signal has relative to its ideal position in time [3, 11]. Let us start by viewing an ideal square wave clock signal with 50 percent duty cycle for which each clock cycle starts at the rising edge of the signal at time 1τ , 2τ , 3τ etc. If the clock signal is affected by jitter, the rising and falling edges can be offset from their ideal positions in time as visualized by the shaded areas in Figure 2.1. This offset of a signal compared to an ideal reference point in time is called the time interval error (TIE). Depending on the context, we may choose to compare an examined signal to an ideal reference like we do in Figure 2.1, or if no reference signal exists, we can instead look at the rising clock edges and compare their time of occurrence from one clock cycle to the next. In Figure 2.2 we see an example of the latter where the period jitter Pn for clock cycle n is the difference between two consecutive rising edges of the signal. Another commonly used measure that does not require a reference signal is the cycle-to-cycle 5 2. Theory jitter denoted by C and it can be calculated by measuring the difference between two consecutive period jitter values as shown in Figure 2.2. Cycle-to-cycle jitter is usually expressed as an absolute value and not in terms of negative numbers. It is worth to note that jitter only is the short-term time deviation in a signal, and that deviation over a longer period of time instead is defined as drift or wander. This could for example typically be the accumulated time deviation between a reference clock and a second free running clock that are in sync at start but where the jitter in the free running clock then causes it to become more and more out of sync with the reference as more and more clock cycles go by. 1τ 2τ 3τ TIE 1 TIE 2 TIE 3 Time A m p li tu d e Ideal clock signal Clock signal with jitter Figure 2.1: Jitter assessment for a clock signal having a reference. P1 P2 C1 = |P2 − P1| Time A m p li tu d e Solitary clock signal Figure 2.2: Jitter assessment for a solitary clock signal. 6 2. Theory 2.1.2 What Causes Jitter? Although jitter is seen as a shift in the time domain, it is often caused by a distur- bance in the voltage domain. In Figure 2.3 we see how noise of amplitude ∆V can create a difference in the voltage level for a rising signal edge and give rise to jitter of size ∆t as it makes the signal reach the threshold level of the signal transition at a different point in time than expected. Threshold level ∆t ∆V Time A m p li tu d e Signal with noise added Signal without noise Noise level Figure 2.3: Visualization of how noise in the voltage domain can produce jitter. There are many possible causes for voltage noise. It can originate from sources external to the signal path. Examples of this are 50 Hz to 60 Hz interference from the fundamental power line frequency, switching power supply noise and capacitive or inductive crosstalk from other cables or signal paths. Noise can also arise from sources within the signal path. Internal thermal noise caused by electrical compo- nents, shot noise appearing due to fluctuations in the flow of electrons or holes in semiconductors, burst noise and 1/f noise occurring in electrical components due to material imperfections are examples of this. Any kind of variation in voltage level can lead to jitter. 2.1.3 Probability Theory for Jitter Distributions As we from Chapter 2.1.1 now know how to define jitter looking at one clock cycle at a time, we will introduce two terms from probability theory and statistics which will aid us when handling longer series of jitter measurements. The first one is the cumulative distribution function (CDF) [3, 12]. Looking at the transition times of the rising edges of a clock signal that is affected by jitter, we can create a function which indicates the probability of the signal having reached its high state at a certain point in time relative to the ideal transition time of the clock signal. This function is called the CDF and a theoretical example is displayed in Figure 2.4. Before τ1, a 7 2. Theory long time ahead of the ideal transition time for each edge, none of the rising edges have reached the high state and the probability of a state transition having happened is zero. As we move past τ1 and closer to the ideal transition time τi for each edge, more and more state transitions are starting to happen. In our theoretical example in Figure 2.4, the number of state transitions happening before the ideal transition time τi has for simplicity been set equal to the number of state transitions happening after the ideal transition time τi, and more state transitions are also happening closer to the ideal transitions time τi than further away from it. This does not necessarily need to be true for an actual series of real world jitter measurements, but it gives us a feasible model which we can work with to understand probability theory for jitter distributions. As we move past the ideal transition time τi towards τ2 in our example, fewer and fewer new state transition happen the further away from τi we get, while the probability of a state transition having happened, the CDF, continues to rise and it reaches its maximum value when we cross τ2, at which point all rising edge state transitions for the theoretical measurement series have already happened. The CDF is a monotonic increasing function, meaning its value will never decrease but instead it will either always remain constant or increase as the function variable increases, and the CDF will go from 0 → 1 when time increases from τ1 → τ2. τiτ1 τ2 0 1 Time C D F Figure 2.4: Theoretical example of a cumulative distribution function for a clock signal with ideal transition time at τi. The probability density function (PDF) [3, 12] is the second term from probability theory and statistics that we will introduce in this section. Let us first consider the probability for a signal transition to happen at time τp. The probability of a signal transition happening exactly at an arbitrary point τp in time is zero as that would require the the transition to take place within an infinitely small time span, but if we instead look at a small time bracket from τp - γ to τp + γ as in Figure 2.7, then the probability for a transition to happen within that time range can be expressed. 8 2. Theory The mathematical relation between the CDF and the PDF is CDF (t) = ∫ PDF (t)dt (2.1) τiτ1 τ2 0 Time P D F Figure 2.5: The probability density function corresponding to the cumulative distribution function in Figure 2.4. τ1 τp τi τ2 0 Time P D F Figure 2.6: The probability density function from Figure 2.5 divided into time brackets with an arbitrary point in time τp selected. 9 2. Theory τp − γ τp τp + γ 0 Time P D F Figure 2.7: A closeup of the probability density function in Figure 2.6 around the arbitrarily selected point in time τp. Figure 2.5 displays the PDF corresponding to the CDF in Figure 2.4. From Equa- tion 2.1 we also realize that choosing to look at the PDF for a single point in time τp instead of a time interval τp - γ to τp + γ will give us an integration interval ranging from τp to τp, and the result of ∫ τp τp PDF (t)dt will therefore be 0, so we need to express the probability of a signal transition happening at τp as the probability of it happening during a time interval τp - γ to τp + γ and not at an exact single point in time. When dealing with any real world measurement series, we will often organize our measurements to fit into predefined time brackets like we have done in Figure 2.6 for the theoretical PDF from our example. 2.1.4 Jitter Types Jitter is often characterized as belonging to one of two categories, being either random or deterministic. The main difference between the two is that random jitter is unbounded, i.e. the jitter can in theory take on any value while deterministic jitter is bounded and therefore only has a limited range of values it can assume. Depending on the source of the deterministic jitter and its characteristics, it is often being specified further as belonging to one of a number of subcategories of its main jitter type. These subcategories of deterministic jitter are presented along with random jitter in more detail in the following sections. Looking at the plotted PDF for a measurement series may help us identify which jitter type we are dealing with so that we can try to determine its cause. The total jitter at any given moment is the sum of all jitter components that happen to be present at that point in time and jitter in any real world measurement is likely to be a composite of multiple jitter types of different origins rather than of just one single type. 10 2. Theory Total jitter Random jitter Deterministic jitter Periodic Data dependent Duty cycle distortion Bounded uncorrelated Figure 2.8: Jitter components contributing to total jitter. 2.1.4.1 Random Jitter The most important properties for random jitter [3, 11] is that the jitter is un- bounded and that the PDF for the majority of cases of can be represented by a normal distribution [12]: f(x) = 1 σ √ 2π e− (x−µ)2 2σ2 (2.2) By selecting the mean value µ for the normal distribution to our ideal transition time t = 0 for the function and setting the standard deviation σ to 1, we can simplify the general expression for the normal distribution in Equation 2.2 to PDFrandom(∆t) = e− ∆t2 2 √ 2π (2.3) also replacing the variable x with ∆t. A graph of the PDF for random jitter with ideal transition time 0 and standard deviation σ = 1 is displayed in Figure 2.9. All the internal types of noise listed in Chapter 2.1.2 belong to random jitter. −6 −4 −2 0 2 4 6 0 0.2 0.4 0.6 0.8 1 ∆t P D F r a n d o m Figure 2.9: Probability density function for random jitter. 11 2. Theory 2.1.4.2 Periodic Jitter Periodic jitter [3, 11] is jitter which repeats with a certain time interval. It is however totally uncorrelated to any clock or data signal in the system and the maximum frequency at which the jitter appears must be less than half the data rate in order for the jitter to be considered to be periodic and not data dependent. Periodic jitter can often be assumed to have a sinusoidal waveform, and for more complex cases the periodic jitter can be decomposed into a discrete Fourier series consisting of multiple sinusoidal waveforms that can be treated separately. The PDF for sinusoidal periodic jitter can be written PDFperiodic,sinusoidal(∆t) =  1 π √ a2−∆t2 |∆t| ≤ a 0 |∆t| > a (2.4) and its graphical representation is displayed in Figure 2.10. -a 0 a ∆t P D F p e r io d ic ,s in u s o id a l Figure 2.10: The probability density function for a sinusoidal periodic jitter dis- tribution. 2.1.4.3 Data Dependent Jitter Data dependent jitter [2, 3, 11] is as the name implies a type of jitter which is dependent on the data pattern that precedes the time at which the jitter manifests itself. There are multiple mechanisms that contribute to this jitter type and they are all related to the signal level being offset in relation to the threshold level which denotes the signal transition. It can be due to reflections in the signal path caused by an impedance mismatch or because the signal transition begins from a voltage level lower or higher than expected on behalf of the signal not having had time to 12 2. Theory settle from the previous signal transition. Bandwidth limitations and asymmetrical slew rates may also affect the rise and fall times of the signal. Any reflections on the signal path will die out within a limited amount of time, resulting in just the most recent data pattern having an affect on the signal level and jitter. The PDF for data dependent jitter can be represented by PDFdependent(∆t) = N∑ j=1 {pj × δ(∆t − tj)}, where N∑ j=1 pj = 1 (2.5) In Equation 2.5, δ(∆t − tj) is the Dirac delta function[13], which has the properties δ(x) =  ∞, x = 0 0, x ̸= 0 and ∫ ∞ −∞ δ(x)dx = 1 (2.6) The graphical representation of the PDF for data dependent jitter will typically have just a few discrete vertical asymptotes, which do not necessarily all have the same height as some data patterns causing the jitter could be more frequent than others. t1 t2 0 t3 t4 ∆t P D F d e p e n d e n t Figure 2.11: Typical probability density function for data dependent jitter. 2.1.4.4 Duty Cycle Distortion The duty cycle defines how much time a digital signal spends in the high state versus how much time it spends in the low state. For an ideal clock signal the ratio would be 50/50 as the signal alternates back and forth between high and low, spending exactly the same amount of time in each state. Deviation from this ideal scheme, whether it is caused by an offset signal amplitude, asymmetry in rise and fall times, 13 2. Theory or an offset threshold level for the signal transition is called duty cycle distortion [3, 11]. The PDF for duty cycle distortion will look like the two equally tall peaks in Figure 2.12, if both rise and fall transitions are included, and mathematically the PDF can be expressed as PDFduty(∆t) = δ(∆t − a) 2 + δ(∆t + a) 2 (2.7) where δ(∆t ± a) is the Dirac delta function from Equation 2.6. -a 0 a ∆t P D F d u ty Figure 2.12: Probability density function for duty cycle distortion. 2.1.4.5 Bounded Uncorrelated Jitter Bounded uncorrelated jitter [3, 11] covers any deterministic jitter which does not fit into any of the other three categories of deterministic jitter that have been presented in this chapter. The sources for this type of jitter can be many and the variety of causes does not make this category of jitter lend itself to making any particular generalizations about it. We will therefore just settle for using it to categorize any bounded jitter which is not periodic, data dependent or caused by duty cycle distortion. 2.1.5 Audibility of Jitter An important question that we should ask ourselves is, “How much jitter can be tolerated before it starts to affect the sound quality?” In order to give a proper answer, we would need to ask subsequent questions such as, “What frequency range does the audio affected by jitter have?” and “What type of jitter is the audio signal being affected by?” Studies by Benjamin and Gannon [14] and Ashihara et al. [15] have shown that jitter will be more audible in source material that has more high 14 2. Theory frequency content than one consisting of lower frequencies. This is because the effect of jitter on an audio signal not only is proportional to the amount of timing error in the signal, but also to the overall slope of the curve of the audio signal being affected. As is visualized in Figure 2.13, the same amount of timing error ∆t on two sine waves of same amplitude but different frequencies will produce a bigger offset a2 in the amplitude on the signal with a steeper slope compared to the amplitude offset a1 on the signal with a more gentle slope, so the distortion is more likely to be audible when playing high frequency audio content. a1 a2 ∆t Time A m p li tu d e Figure 2.13: The size of an amplitude error caused by a timing error depends on the slope of the signal. For the question regarding the type of jitter affecting the audio signal, one of the first well known studies on this subject conducted by Manson [16] back in 1974 found the hearing threshold for sinusoidal jitter to be a little bit lower than for random jitter. The mentioned studies by Manson [16], Benjamin and Gannon [14] and Ashihara et al. [15] all include listening tests by which their respective jitter audibility thresholds are determined, but none of them do in subjective terms express how the test participants experienced jitter to affect the sound or what made the test subjects pick out the tracks with jitter and distinguish them from the ones without it. We can however note that most selected test tracks contained high frequency source material as jitter audibility is greater for higher frequencies. Tracks with solitary elements and sparse sound like a single instrument were also favored in place of more complex tracks with a multitude of sound sources as that also made it easier to detect the added jitter. Returning to the study from 1974 by Manson [16], it set the limit at which jitter could only be heard by less than 5 % of the listening audience for sinusoidal jitter with frequency above 2 kHz to 35 ns, and for random jitter the same limit was determined to be 50 ns. For sinusoidal jitter with frequency lower than 2 kHz, the 15 2. Theory tolerance threshold proposed by Manson increases linearly as the frequency of the jitter is lowered. Test tracks consisting of piano and glockenspiel were selected to provide the most critical material out of a range of tracks auditioned by experienced listeners. All tests were conducted in one room using the same speaker system and the test participants were all described as having previous experience of assessing sound quality. During the listening tests which were carried out one person at a time, the test subject was allowed to control the listening level and was also given a control from which the jitter level in the recording could be adjusted and then another one by which the addition of jitter could be turned off completely. The test subject was then asked to find the threshold level for jitter audibility using the controls available. Jitter was added to the audio signal by passing it through two sample-and-hold units and then reclocking it in the second one by applying a control signal which perturbed the clock signal to simulate both random and sinusoidal jitter depending on the setting. Low-pass filters were also added before and after the described jitter addition circuit to comply with the sampling theorem. With the study being conducted nearly 50 years ago, tape recordings were used as source material and playback was done in monophonic audio. Surely some advances in both recording and playback technology have been made since, but whether using stereo playback instead of mono would make the jitter audibility threshold limits any lower is debatable. Small variations in timing in the microsecond range between what the left and right ear registers can be picked up by the hearing system to provide spatial information [17]. Given that the added jitter necessarily does not affect both channels equally, any disturbance caused by the jitter could possibly be picked up more easily if the audio was to be played back in stereo. On the other hand the jitter threshold levels found were way below the microsecond range and any added complexity in the source material makes it more difficult to distinguish a track with added jitter from the original, in which case adding an extra channel possibly could have made the recorded audibility threshold for jitter even higher. In 1998 Benjamin and Gannon [14] also conducted a study where they performed listening tests in order to try to determine the jitter audibility threshold. As the audibility of jitter was found to greatly depend on the dynamic variation in the frequency spectrum of the examined audio, a lot of effort was put into finding source material where the effects of jitter would be easy to hear. Based on the criteria of having plenty of frequency content at 1 kHz or above, minimal frequency content between 400 Hz and 1 kHz, long sustain and low noise floor, this resulted in the majority of test tracks consisting of one note from a single instrument. During the initial phase of creating the listening tests it was discovered that there was a learning effect taking place where the person being subjected to the jitter audibility testing up to a certain degree was able to increase their ability to hear the effects of jitter, thus lowering the jitter audibility threshold in successive tests. A learning phase was therefore added for all test participants prior to the listening tests used to determine the jitter audibility threshold in order to let the test subject to get familiar with the source material, controls, and test procedure to not have the threshold value decrease while the real tests were being carried out. Any intended test participants who had severe difficulties distinguishing the distortion caused by 16 2. Theory jitter during the training phase were excluded from the final testing. After the training phase, testing began with solitary sine wave tones of frequencies 4 kHz, 8 kHz and 20 kHz as source material to which sinusoidal jitter then was added. The jitter level was at first increased and the test subjects were asked to indicate when they were able to hear the resulting distortion. Then the jitter level began to slowly decrease until the test subject indicated that they were no longer able to hear the distortion caused by the jitter. The process was then repeated a couple of more times for all three sine wave frequencies and the top, bottom and calculated average level was recorded for each participant. Table 2.1 lists the range of calculated average threshold levels for the test participants. Audio frequency Jitter frequency Jitter audibility threshold 4 kHz 2 kHz 40 ns to 150 ns 8 kHz 5 kHz 5 ns to 25 ns 20 kHz 17 kHz 7 ns to 14 ns Table 2.1: Range of calculated jitter audibility thresholds for all test participants when playing sine wave tones with added sinusoidal jitter. In the next part of the listening test, the earlier selected audio source material was played back to the test participant. Now given access to control the level of sinusoidal jitter added to the source material as well as having the ability to switch between the audio signal with added jitter and one with without at will, the test participant was asked to adjust the controls until the threshold level for jitter distortion audibility had been reached. Table 2.2 shows the recorded threshold ranges for the participants for each test track. Test track Jitter frequency Jitter audibility threshold 1: One note, single instrument 1.70 kHz 50 ns to 270 ns 2: One note, single instrument 1.85 kHz 32.5 ns to 110 ns 3: One note, single instrument 1.70 kHz 20 ns to 310 ns 4: Synthesized music recording 1.53 kHz 112 ns to 370 ns* *Not all test participants were able to find an audibility threshold for track 4. Table 2.2: Range of jitter audibility thresholds recorded for all test participants when playing program material with added sinusoidal jitter. The audibility threshold for the jitter added to the higher frequency sine waves was slightly lower than it was for any of the other more regular program material and the results for the program material can be considered to be on par with what Manson [16] found for sinusoidal jitter added to the selected program material in his study. The same audio equipment was used for all the listening tests and a set of headphones instead of speakers were selected to reproduce the audio recordings in the study. The jitter was added to the source material by running the signal 17 2. Theory through a jitter modulator to which a function generator was connected, through which the jitter level could be controlled. Measurements on the audio system used in the listening experiments indicated that the jitter levels inherent in the system itself were way below any of the audibility thresholds recorded during testing and should not have any influence on the test results according to the authors. A third study including jitter audibility testing was also conducted by Ashihara et al. [15] in 2005. In it, random jitter was simulated in software by creating new sample values by interpolation after which the interpolated values were shifted to the ideal sampling points in time. An anti-aliasing filter was also added to make sure the sampling theorem was still satisfied. The test subjects, all consisting of people with backgrounds in different audio fields were asked to audition source material of their own selection using their own audio equipment. Only a computer with a digital audio interface was provided as signal source and three controls, A, B and X were given from which the playback of the source material could be controlled. Selecting X always set the original source material without any added jitter to be played back. One of the controls A and B was randomly set to also select the original non-jittered source material while the other control selected the source material with the added random jitter. The test subject was informed of this setup, asked to listen to the selected source material for a couple of minutes and then at the end decide which one of the controls A and B played back the same version as control X. The test started with plenty of jitter being added to the source material and the test was run multiple times under the same conditions. The test subject was allowed to proceed to the next step where the jitter level was halved once 75% or more of the attempts were correct. If too many incorrect answers were given, the test was aborted and the final successfully determined jitter level was recorded. Table 2.3 shows the results from the listening test. None of the test subjects were able to audibly distinguish the next level of random jitter after 500 nanoseconds. The recorded jitter audibility threshold level being a bit higher in this study than in the ones previously presented does however not come as a big surprise as only random jitter was used and more importantly, the program material in the previous studies was tailored to maximize the audible effects of jitter while more “normal” music likely was used here as the participants were allowed to pick their own listening material. Random jitter Audibility among test participants 2 µs 100 % 1 µs 48 % 500 ns 26 % 250 ns 0 % Table 2.3: Proportion of test participants that were able to hear the effects of random jitter added to self selected source material. In all three studies mentioned so far, listening experiments were used to determine the threshold limits for jitter audibility. The lowest recorded threshold values from 18 2. Theory any of the studies were in the single digit nanosecond range, noted when sine waves of high frequency were used as program material. Using recordings of solitary real instruments as program material increases the threshold to tens of nanoseconds and using even more varied and complex source material sends the threshold limit into the range of hundreds of nanoseconds. 2.1.5.1 A Theoretical Jitter Audibility Model An idea commonly presented is that jitter levels below the quantization noise floor will be inaudible [14, 15, 18]. In Figure 2.14, quantization with low resolution has been used to show an example of an analog signal and its quantized digital counterpart. The horizontal grid lines indicate the digital levels available that can be assumed and the vertical grid lines mark the sampling interval points. Any real audio application is likely to use a much higher resolution to produce a smoother digitalized waveform that more closely resembles the analog signal, but even then, there will still be a least significant bit (LSb) size in the digital representation of the audio signal that together with other system parameters sets the level of the noise floor. For a DAC with a resolution of N bits, the total number of values that can be represented is 2N and the LSb is 1/2N of the total representable range. Time A m p li tu d e Analog signal Quantized signal Figure 2.14: Analog sine wave and the resulting waveform after quantization in low resolution. To find a relation between all the system parameters and the quantization noise floor, we can start by considering a sine wave: y(t) = Asin(2πft). (2.8) The rate of change for the curve is dy(t) dt = 2πfAcos(2πft). (2.9) 19 2. Theory At t = 0 the rate of change will have its maximum value as the slope for the sine wave will be the steepest there. The limit lim t→0 cos(2πft) = 1, and we are left with ( dy dt ) max = 2πfA. (2.10) If the full range of a converter is used to present a sine wave of amplitude A, then the interval from 0 up to the highest representable digital level will be of magnitude 2A. We know that the LSb is 1/2N of the total representable range, so we multiply the expression with 2A. For any signal, the rate of change multiplied by the amount of time during which the change occurs will determine the new level of the signal. The LSb can therefore also be expressed as the the rate of change dy/dt multiplied by the amount of time tj it takes to reach the new level when the rate of change is at its maximum. Rearranging the first expression a bit we then have tj dy dt = LSb = A 2N−1 . (2.11) Substitution in Equation 2.10 with Equation 2.11 gives us A tj2N−1 = 2πfA. (2.12) After rearranging Equation 2.12 we have an expression for tj which corresponds to the time of one LSb: tj = 1 2πf2N−1 . (2.13) For any timing jitter to fall below the quantization level floor, it would need to be equal to less than half a LSb, so any result to Equation 2.13 needs to also be divided by a factor of 2. For a 16-bit converter with maximum sampling frequency of 20 kHz, the jitter level should therefore be lower than tj/2 = 121 ps for it to fall below the quantization noise floor and be inaudible. 2.2 AES/EBU and S/PDIF The physical appearance and the electrical characteristics of the two interfaces are different as the for professional use intended AES/EBU [5] interface uses balanced XLR connectors while the more consumer oriented S/PDIF [6] uses either a coax- ial cable with RCA connectors or an optical wire with TOSLINK connectors, but beneath the dissimilar exterior they both use the same data transfer protocol. The audio data and the clock signal are transferred along the same data line, combined into one bit stream using biphase mark code (BMC). Data is transferred in blocks consisting of 192 frames, every frame is divided into multiple subframes, one for each audio channel, and each subframe contains 32 bits of data. Figure 2.15 shows the structure of the transfer scheme including the placement of the data fields inside a subframe. 20 2. Theory Frame 1 Frame 2 ... Frame 192 (a) Block Subframe 1 Subframe 2 ... Subframe n-1 Subframe n (b) Frame Preamble Auxiliary Audio sample word V U C P 0 4 8 28 29 30 31 (c) Subframe Figure 2.15: Data structure of AES/EBU and S/PDIF. The contents of a) an audio block, b) a frame and c) a subframe. Bit no. Subframe field Usage 0–3 Preamble Indicates the start of a subframe. The field spec- ifies if it is a) the first subframe within a frame, b) the first subframe in a block, c) any other sub- frame. 4–7 Auxiliary bits Used to add auxiliary information or can be as- signed to carry an extra 4 bits of audio data, ex- tending the audio sample word size from 20 to 24 bits. 8–27 Audio sample word The bits used to carry the audio data sent with LSb first. 28 Validity bit Indicates if the audio sample word contains valid audio data or not. 29 User data bit Can be used to carry any user defined data. 30 Channel status bit Used to indicate type of interface, sample rate, copy permission and other settings. The mean- ing of the bit is dependent on the frame number within the block in which it is transmitted. All subframes within a frame carry the same channel status bit. 31 Parity bit Used to detect errors in the transmitted data. Table 2.4: Description of subframe data fields. 2.2.1 Biphase Mark Code The clock signal and the audio data together with all other parts of the subframe are for the AES/EBU and S/PDIF interfaces sent along the same data line using biphase mark encoding [19], also known as Differential Manchester encoding. For 21 2. Theory each bit of data that is sent, at least one signal transition is guaranteed to happen. If the data bit is a “0”, the encoded signal sent to the receiver will switch polarity once at the start of the time slot, and if the data bit is a “1”, the signal will change polarity twice, once at the start and once in the middle of the time slot for that bit. The only part of the subframe data that is allowed to violate this condition is the first part containing the preamble bits. It only has three valid bit sequences used for indicating the placement of the subframes in the data stream. The reason for this is to ensure so that no other data in the subframe can contain the same bit sequence as the preamble, throwing off the synchronization of the data that is being transmitted. An example of BMC is shown if Figure 2.16. Clock Data BMC + - 0 1 1 0 0 0 1 0 Figure 2.16: Biphase mark code timing diagram. 2.2.2 Clock Recovery After data has been transmitted, the receiver will need to recover the clock signal and separate it from the rest of the encoded data and there are some problems that can arise in the recovery process. A study by Dunn and Hawksford [2] done relatively soon after AES/EBU and S/PDIF were started being used widely points out many of the known problems with the interfaces’ characteristics; the most significant one being that jitter can be dependent on the data pattern in the transmission. Due to bandwidth limitations, the rise and fall times of the transmitted signal will be limited. This can cause the signal to begin the transition from high to low or vice versa from a voltage level depending on the previous data pattern as the signal has not had time to settle from the previous transition due to the limited bandwidth. Dunn and Hawksford [2] created a simulation model of a bandwidth limited trans- mission channel using a passive low pass filter, which despite its simplicity showed a generally good agreement with measurements conducted on real systems. 22 2. Theory R CVin Vout Figure 2.17: First order passive low pass filter used to simulate bandwidth limited transmission channel. The voltage Vout over the capacitor in Figure 2.17 is Vout = Vin ( 1 − e− t RC ) + V0e − t RC , (2.14) where Vin is the amplitude of the non-filtered signal and V0 is the voltage level from which the signal transition begins. Using the filter to simulate a bandwidth limited transmission line with time constant τ = RC of 100 ns between a S/PDIF transmitter and receiver gives the graph in Figure 2.18. Looking just at the first eight bits of the subframe in Figure 2.19, we can more clearly see that for each signal transition the starting voltage is different and that it depends on previous signal transitions. The consequence of this is that the time when the threshold level at 0 V will be reached at a signal transition will vary depending on the previous bit pattern and as we have seen before, a shift in voltage level can create timing jitter. This is shown in Figure 2.20 where bits four and five of the subframe are displayed and the difference between the signal edge and the bandwidth limited signal is inconsistent from signal transition to signal transition. The threshold level 0 V crossing time is given by the equation t = RC ln ( 1 + ∣∣∣∣ V0 Vin ∣∣∣∣) . (2.15) Solutions to help lessen this issue with data dependent jitter include using data patterns in the auxiliary bits and user bits which are less prone to creating jitter [2] and to only use the first bits in the preamble to lock on to the signal and to create a local clock from only that bit sequence instead of using every transition in the whole subframe to generate it [20]. V− 0 V+ PREAMBLE AUX AUDIO WORD V U C P 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Figure 2.18: Transmission of one subframe over bandwidth limited channel. 23 2. Theory PREAMBLE AUX 0 2 4 6 8 V− 0 V+ Figure 2.19: First eight bits of the subframe. 3 4 5 V− 0 V+ t1 t2 Figure 2.20: Bits four and five of the subframe. 2.2.3 Asymmetric Slew Rates Another thing that can cause jitter in AES/EBU and S/PDIF interfaces is slew rate imbalance, giving asymmetric rise and fall times for the signal. Dunn and Hawksford [2] presented the formula shown in Equation 2.16 for the amount of jitter tj per signal transition that is created by an asymmetry in the slew rates. Vd is the driving voltage of the transmitter, VSR+ is the slew rate in the positive direction and VSR− is the slew rate in the negative direction. tj = |Vd| 2 ∣∣∣∣∣ 1 |VSR+| − 1 |VSR−| ∣∣∣∣∣ (2.16) 24 2. Theory A visualization of a signal response with symmetric versus asymmetric slew rates is shown in Figure 2.21. The slew rate limited response with symmetric rise and fall times will cross the threshold level at 0 V at times τ1 and τ3 instead of at τ0 and τ2 like the ideal square wave does, but the time difference τ3 − τ1 = τ2 − τ0, so this does not present a problem. The falling signal edge for the slew rate limited response with asymmetric slew rates will on the other hand cross the threshold level at τ4 instead of at τ3 as the falling edge only has half the rate of change per time unit compared to the rise time and τ4 − τ1 ̸= τ2 − τ0, so the slew rate asymmetry introduces jitter into the signal. One solution suggested to solve the issue is to let the receiver rely on signal transitions in one direction only, effectively removing the need for the slew rates to be even. τ0τ1 τ2τ3τ4 V− 0 V+ Threshold Time V ol ta ge Ideal square wave Slew rate limited, symmetric Slew rate limited, asymmetric Figure 2.21: Symmetric versus asymmetric slew rate response to an ideal square wave. 2.2.4 Transmission Lines A model for a transmission line used for high frequency signals [1, 19, 21] is shown in Figure 2.22. The parameters, resistance R, inductance L, capacitance C and conductance G are per length unit of transmission line. For a S/PDIF interface using a coaxial cable to transfer audio, the same model can be applied. R L R L R L R L 1 G C 1 G C 1 G C 1 G C Figure 2.22: Cascaded network model for high frequency transmission line. 25 2. Theory The transmission line has a characteristic impedance of Z0 = √ R + sL G + sC (2.17) with s being the frequency operator, for a sine wave often denoted by jω. Let us now attach a load with impedance ZL to the transmission line and then apply a voltage pulse of size Vi to the other end of the line as depicted in Figure 2.23. What then happens when the voltage reaches the load depends on the impedance ZL of the load. For a perfectly matched system where the transmission line impedance Z0 is equal to the load impedance ZL, the whole voltage pulse Vi will continue into the load, but if the impedances differ, a part of the voltage will be reflected back along the transmission line. Equation 2.18 gives the reflection coefficient ρ of the system. ρ = Vreflected Vincident = ZL − Z0 ZL + Z0 (2.18) When the voltage pulse Vi in Figure 2.23 is applied, it will start to move along the transmission line towards the load with the propagation velocity v = c √ ϵrµr , (2.19) where c is the speed of light, ϵr is the permittivity and µr is the permeability for the transmission line. Equation 2.18 gives the ratio between the incident voltage and the reflected voltage. If for example the load impedance ZL and the transmission line impedance Z0 are severely mismatched with the load impedance ZL being twice the size of the transmission line impedance Z0, then the reflection coefficient ρ is 0.33 and the amplitude of the reflected voltage pulse is one third of the amplitude of the applied voltage Vi. For a more tightly matched system where the transmission line impedance Z0 and the load impedance ZL only differ by 1 %, the amplitude of the reflected voltage pulse at the load end will still be around 5 mV per 1 V of applied voltage Vi. Figure 2.24 shows the voltage on the transmission line before it has reached the load and after a part of it has been reflected. ZLZ0Vi Figure 2.23: Transmission line with load attached. 26 2. Theory Load 0 Vi (a) Approaching the load. Load 0 Vi Vi + Vref (b) Reflected voltage at the load. Figure 2.24: High frequency voltage pulse applied to a transmission line. The transmitting side of the system where the voltage pulse Vi originates from does also have an impedance of its own and the same reasoning applies to that end of the circuit, so if there is an impedance mismatch between the transmitting side and the transmission line, then a part of the voltage pulse first reflected at the load end will be reflected once again when it reaches the transmitting side. A pulse can therefore be reflected several times between the transmitter and receiver sides if both ends have an impedance different from the transmission line impedance Z0. In reality, any such voltage pulse bouncing back and forth is likely to diminish quickly as |ρ| < 1 for any case except a completely open or fully shorted circuit end. We know from Chapter 2.1.2 that voltage noise can lead to timing jitter, so even small reflections due to impedance mismatching between the transmission line and the load on high frequency transmission lines, in our case the coaxial cable connecting the transmitter and the receiver and the transmitter and receiver units themselves, can cause issues. Proper impedance matching in the audio chain between the transmitter, receiver and the cable connecting them is therefore necessary. S/PDIF and AES/EBU signals being transmitted in coaxial cables are also affected by other attributes of the transmission channel apart from the impedance. Dielectric losses and the skin effect where a high frequency signal travels mainly along the surface of the conductor only penetrating a short distance into the core of it are some examples of things that could be expected to affect the voltage level and rise times of the signal. Both are dependent on the material parameters of the cable and on the frequency of the signal being transmitted, but as the frequency for the clock signal being sent is expected to stay the same, then all signal transmissions should be affected to an equal extent in which case no new variable jitter would be added to the signal. Optical channels also have their own share of issues that could be expected to affect a signal propagating through the optical wire such as pulse dispersion and limited bandwidth in the transmitter and received components but we will not go any further into if and how that might impact a digital audio signal being transmitted. 27 2. Theory 2.2.5 FIFO Buffers One attempt at solving the interface jitter issues of AES/EBU and S/PDIF has been to insert a first in, first out (FIFO) buffer between the receiver chip and the DAC in the converter and to then reclock the data coming out of the FIFO buffer. It has been used by some audio manufacturing companies, but there are some drawbacks to this method. The introduction of a buffer in the audio chain will undoubtedly delay the audio signal. This might be acceptable up to certain degree if the audio only is used for music playback, but if the transmitted audio stems from a video stream, then the audio and video can become noticeably out of sync unless the buffer is small. A delay could also cause problems if the audio system would be used for communication in a telephone type of manner, as that would make the communication disruptive and less smooth. The purpose of the added FIFO is to reclock the signal with a clock that has less jitter than the one arriving from the transmitter, so two clocks, the one supplied by the transmitter feeding the audio data into the FIFO and a second one supplied by the receiver moving data out of the FIFO will be running freely, not synchronized to each other. The clocks will essentially be running at the same rate but any difference or variation at all in the clock rates will make the clocks start drifting apart and this must be remedied by having a large enough buffer size to accommodate for the drift between the clocks so that the FIFO buffer does not underrun or overflow. If we have a system with 44.1 kHz sample rate, then one new audio sample will arrive every 22.68 µs for each audio channel. A normal oscillator (XO) like for example the one used to generate the external clock to our DAC in Figure 3.12 can have a frequency stability rating of ±100 parts-per-million (ppm), which means that the clock could in the worst case be off by one in every 10 000 samples compared to an ideal clock. If we have two clocks with the same frequency stability rating running side by side where both clocks have maximum deviation from the ideal frequency but in opposite directions, then the sample rate could be off by up to 8.82 samples per second. In an hour that amounts to 31 752 samples, so if we fill the FIFO buffer up half-way before we start extracting data from it, then it would need to be able to fit 63 504 samples for each audio channel to guarantee uninterrupted playback for one whole hour. The delay caused by the FIFO buffer would in that case be 0.72 s at the start of playback. While not ideal, this could be acceptable for audio playback, but in other applications such as video streaming, the delay between the video and the audio would just be too big unless otherwise adjusted. Another option that has been tried together with a FIFO buffer is to use an asyn- chronous sample rate converter (ASRC). The average incoming data rate is first measured and then the audio data signal is resampled by the ASRC to match the rate of the clock which extracts the data from the FIFO buffer and hands it over to the DAC. In this way the buffer will not need to be so large as the ASRC will adjust the audio samples by interpolation so that the average data rate for the audio data going into the FIFO is the same as the date rate coming out of it, and the buffer will therefore not overflow nor underrun even though the clock rates at the input and output of the buffer might be slightly different. The use of an ASRC could 28 2. Theory however give other undesirable audible effects depending on how well it has been implemented. Including not only a FIFO but also an ASRC in the design also adds to the complexity of the device. 2.3 Universal Serial Bus The next section in this chapter will mainly focus on the parts of USB which are of relevance for the thesis like the transfer protocol structure, the device descriptors, the isochronous transfer modes and other audio and timing related subjects. With there being multiple versions of the USB specification, Universal Serial Bus specification revision 2.0 [22] is the version that the coming sections will comply with; this simply because it is the most appropriate version considering the hardware that will be used in the construction part of the project. In the specification there are several attributes declared that make USB suitable to be used as a dedicated audio interface. Among other things, guaranteed bandwidth and low latency for audio are listed as key points. The implementations in the hardware construction part of the thesis project are done using a “full-speed” device as defined by the USB 2.0 specification. Subsequent revisions of the specification [23, 24] supporting SuperSpeed devices do introduce some new concepts, but they are of no use for us here. Time units are for example handled differently. Full backward compatibility to USB 2.0 is however guaranteed for any full- or high-speed device connected to a host port using SuperSpeed. Whenever mentioning the USB specification going forward, revision 2.0 is what is being intended unless explicitly stated otherwise. 2.3.1 Network Topology A USB system is controlled by a single host, polling the bus to which devices are connected. Devices can be grouped into different classes, such as for example the audio device class. The communication channels between the host and a device are called pipes, they can carry messages or stream data, and they are connected to endpoints at the device and at the host. At a minimum a device must implement at least one bidirectional message pipe called the default control pipe, which is connected to the control endpoints of the device. Capabilities added to a USB system for example in the form of an audio interface are called functions, so from our point of view we can use the terminology for device and function interchangeably. The network topology of USB has a tree-like structure. At one end, there is the controlling host at which the root hub resides. Devices or other hubs can be con- nected to the root hub, and in extension more devices and hubs can be connected to a hub which is connected to the root hub as displayed in Figure 2.25. In order to not violate the specifications set for timing, no more than six additional levels following the root hub layer can be connected together. 29 2. Theory Host Root hub Tier 2 deviceTier 2 hub Tier 3 device Tier 3 hub Tier 4 deviceTier 4 hub Tier 5 device Tier 5 hub Tier 6 deviceTier 6 hub Tier 7 device Tier 3 device Tier 5 deviceTier 5 device Tier 7 device Figure 2.25: USB topology. 2.3.2 Connecting a Device to the Bus When a device is connected to the bus, the host will need to discover and configure it before it can perform any function. The process of configuring and enabling the device is called enumeration, and it is done by performing a number of steps through which the device state is altered until configuration has completed. At first, the hub to which the device has been connected will set the device to the powered state and report to the host that its status has changed. This will cause a query to be sent from the host to the hub to find out what caused the change. Once the host knows that a device has been attached, it will send a reset command and have the port to which the device is attached to set to enabled. After the device has been reset, it will go into the default state during which the host can communicate with it using the default address. Following steps in the configuration process will assign a unique address to the device, causing it to go into the address state, and finally into the configured state once the host has read all the configuration information in the device’s descriptor table and has assigned a configuration value to it. In the configured state, all endpoints described in the device’s descriptor table have been enabled and the device is ready for use. 30 2. Theory Attached Powered Default Address Configured Hub configured Reset Address assigned Device configured Figure 2.26: Device state changes during enumeration process. 2.3.3 Descriptors A device reports its capabilities and settings to the host upon request through its descriptors. The host can also change some of the device settings by altering the values in the device’s descriptor table by device requests. A device has exactly one main device descriptor that contains general information about the device and it also lists one or more possible configurations of the device in the underlying configuration descriptors. A configuration descriptor will in turn list one or more interface descriptors, and each interface descriptor will then list one or more end- point descriptors. When a configuration descriptor is requested by the host, it will be returned by the device accompanied by any underlying interface and endpoint descriptors. Interface and endpoint descriptors cannot be requested on their own. Alternative settings for the interfaces may be provided by having multiple configu- ration descriptors. The default control endpoint is not listed among the endpoint descriptors as it must be implemented by all USB devices as a control pipe with predefined settings. Endpoint descriptors declare among other things the direction of the endpoint, if it is a control, isochronous, bulk or interrupt endpoint and what type of synchronization it uses. Endpoints are unidirectional, but two endpoints with the same endpoint number can be created with opposite data directions. A feedback endpoint associated to a single isochronous endpoint is expected to have the same endpoint number as the isochronous endpoint, and in case multiple end- points are using the same feedback endpoint, then the endpoint number used for feedback should be the same as for the isochronous endpoint with the lowest end- point number associated to it. A high-speed device can also have a device_qualifier descriptor and an other_speed_configuration descriptor. The device_qualifier de- scriptor is similar to the device descriptor, but instead of providing information about the device for the current speed setting, it will show device information for the alternative speed setting. Requests for the device_qualifier descriptor will there- 31 2. Theory fore for a device running in high-speed return the full-speed information and for a device running in full-speed it will be the other way around. In the same way the other_speed_configuration descriptor will return a configuration descriptor of the alternative speed setting that the device is not currently using. An optional string descriptor can be included to provide information in readable Unicode text format for all devices. Figure 2.27 shows the overall structure of the descriptor table and example descriptors for the audio device implementations used in the construction build are provided in Appendix A. Device descriptor Configuration descriptor(s) Interface descriptor(s) Endpoint descriptor(s) Figure 2.27: Configuration, interface and endpoint descriptor structure. 2.3.4 Device Classes and Device Requests The descriptors on the device are made accessible to the host by replies to device requests sent from the host to the default control pipe of the device. Device requests can be of standard type, which are supported by all devices, or they can be class or vendor specific. Device requests are used to either fetch values from a device descriptor or to manipulate the values in them. The prefixes “GET” and “SET” are used in the request name to indicate if a request is meant to retrieve or change the descriptor data that is being referenced. The only standard device requests that do not use the two mentioned prefixes are the CLEAR_FEATURE request, which is used to switch off on-off toggle values and the SYNCH_FRAME request, which is used to synchronize the host and the device when the size of the transferred data varies within a frame. A device request transaction will follow the pattern for a control transfer with an initiating SETUP packet, an optional data packet depending on the request type, and a closing handshake like depicted in Figure 2.35 in Chapter 2.3.10.2. If a device receives an invalid or unsupported request, it should respond appropriately and signal the error by setting the packet identifier to STALL either in the following data packet or the next status transaction. The USB audio device class is an extension of the USB standard, so a USB audio class device will have a number of extra descriptors containing information about the device’s audio capabilities and it will also on top of the standard device requests support requests from the USB audio class specification. Software on the host communicating with an audio class device can use the standard audio class driver provided by the operating system, but it is also possible to load an external driver specific to the device and use that one instead of the generic driver. 32 2. Theory 2.3.5 Transfer Types Most transactions on the bus consist of three interactions: 1) The host sends a “token packet” with parameters to set up a transaction with one of the connected devices. 2) An attempt to transfer the requested data is made, either in the direction from the device to the host or from the host to the device. 3) The receiver of the data sends a “handshake packet” to indicate if the data was transferred successfully or not. There are four types of data transfers that can take place: Data transfer type Usage Control transfer Used for device configuration, commands and status re- quests. Bulk transfer Non-periodic transmission of non-time sensitive data, usu- ally sent in larger chunks. Interrupt transfer Transmission of smaller amounts of time sensitive data that must be delivered reliably. Isochronous transfer Periodic transmission of real-time data with minimal de- lay. Table 2.5: Data transfer types for USB. We will not make any use of the bulk transfer mode or the interrupt transfer mode in any USB audio class devices described in this thesis and therefore no time will be spent on expanding the discussion around those subjects. Isochronous data transfer is the transfer mode used by USB audio devices to move audio data, so that is of most interest to us. Control transfers are also to some degree used by USB audio devices for supportive functionality. 2.3.5.1 Control Transfers Control transfers allow the host software to configure and control device functions using the default control pipe of the device. Additional message pipes for control transfers used for other device specific purposes can be defined but are not obligatory. Requests to alter the device settings can be either standard, device class, or vendor specific. Error free message delivery is guaranteed for this transfer type and bus access is granted in a best effort manner. Time is reserved for control transfers on the bus, but that time reservation is shared between all the connected devices and it is not limited to a single device. 2.3.5.2 Isochronous Transfers The characteristics of the isochronous transfer mode makes it the most suitable of the USB transfer types for transmission of data like audio which is consumed in real-time. USB guarantees periodic access to the bus for isochronous data transfers 33 2. Theory with an upper bound on the maximum allowed latency. The latency of the trans- mitted data will depend on the amount of buffering that is done at each stage in the transmission chain. No retransmissions are made for any data lost in transmis- sion errors, but the receiver can still discover if a transmission error has occurred by keeping track of start-of-frame (SOF) count, expected delivery interval, cyclic redundancy check (CRC) field of packets and if it is a high-speed high bandwidth device, then also the packet ID sequencing can be used. The number of transmission errors occurring is however expected to be low enough to not cause any problems. As a side note, a recommended bit error rate of less than or equal to 10-12 for a high-speed receiver is mentioned as a design guideline in the section for electrical characteristics in the USB specification. 2.3.6 Time Units For full-speed devices, USB divides time into units of 1 ms called frames. High- speed devices are able to use a narrower time span of 125 µs called a microframe. Each new frame is defined by the host sending out a SOF packet every 1 ms ±0.5 µs that devices can use for synchronization. The same generation rate of SOFs for microframes is set to 125 µs±0.0625 µs by the USB specification. 2.3.7 Bus Access Period A device using isochronous transfers must at the time of being connected to the bus inform the host software of its desired bus access period so that bandwidth can be allocated to accommodate the required data rate. This is done by setting an appropriate value in the bInterval field of the device’s standard endpoint descriptor. Valid values for isochronous endpoints are between 1–16 and the formula I = (2bInterval−1)F (2.20) expresses the desired polling interval in frames or microframes. F is the frequency of one frame or microframe depending on the speed of the connected device, so for a high-speed device F is 125 µs and for a full-speed device it is 1 ms. For a high- speed high bandwidth device up to three transactions can take place during one microframe, but the host may not always be able to fulfill the desired access interval of the device. Figure 2.6 shows the packet ID sequencing depending on the number of packets sent to the high-speed high bandwidth device during a microframe. By keeping track of the bit sequence in the packet ID field of the received packet, the device can detect if a packet is missing and if that is case then all data sent during that same microframe should be treated as incomplete. 34 2. Theory Number of transactions per microframe 1st packet 2nd packet 3rd packet 1 transaction DATA0 2 transactions MDATA DATA1 3 transactions MDATA MDATA DATA2 Table 2.6: Packet ID sequencing for a high-speed high bandwidth device receiving isochronous data from host. 2.3.8 Endpoint Buffering Before creating, configuring and allocating bandwidth to an isochronous stream pipe for a device, the USB host software will calculate the amount of time that the isochronous transactions are going to take to make sure that the needs of all devices sharing the bus can be accommodated. When data is sent through an isochronous stream pipe, it is first accumulated in a memory buffer and then transmitted in larger chunks in the form of packets. There must also be a buffer at the endpoint receiving the packets that can hold them until the device is ready to process them. As a rule of thumb, the recommendation is that the size of the buffers at both endpoints should be large enough to be able to fit twice the amount of data that can be sent during one frame for a full-speed device, or one microframe for a high-speed device. The larger the buffers are, the bigger the latency in the audio chain is. An appropriate buffer size Bsize can be obtained by the formula Bsize = 2S  Fs FSOF I  , (2.21) where Fs is the sample rate frequency of the system, FSOF is the frequency of the USB clock, I is the polling interval from Equation 2.20 and S is the sample size of the device. 2.3.9 Prebuffering Delay The way in which USB processes isochronous data through the buffers at each of the endpoints when it is transferred from source to sink will inherently add a delay. At the source, data will be accumulated and buffered during frame X for a full-speed endpoint until the SOF for frame X+1 is transmitted. The data in the buffer from frame X will then be sent during frame X+1 to the buffer at the sink endpoint. First when the SOF for frame X+2 appears can the sink start processing the data that was accumulated during frame X at the source. Figure 2.28 displays the buffering delay. The same applies for a high-speed endpoint, but instead of frames the time unit used is microframes. 35 2. Theory Frame number Data accumulated at source Data sent on bus Data processed at sink F1 F2 F3 F4 ... Fx D1 D2 D3 D4 ... Dx − D1 D2 D3 ... Dx−1 − − D1 D2 ... Dx−2 Figure 2.28: Delay induced due to prebuffering at endpoints. 2.3.10 Transfer of Data Data is sent on the bus using non return to zero inverted (NRZI) encoding with the LSb first. The polarity of the NRZI encoded signal changes for every data bit that is “zero” and remains the same for every data bit that is “one”. Like biphase mark encoding used by S/PDIF and AES, NRZI has the same benefit of only having a small DC component. A separate signal line for a clock is likewise also not needed as the receiver can create the sample clock by itself. As a long series of data containing nothing but ones produces a NRZI encoded signal that has no transitions from high to low or vice versa until the next “one” in the data appears, extra bits are inserted into the NRZI encoded data to guarantee that a signal transition happens at least every 7th bit. This is enough to ensure that the receiver can lock on to the signal. Any i