USB Specifications

Audiophile USB

graphic of USB symbol

This article will concentrate on the parts of the USB specifications that involve Audio devices and music streaming. Some additional information will be included to provide a more complete overview. This paper is summarized from the official USB 2.0 specification document. The 2.0 specification is well over 600 pages long. You can visit www.usb.org which is the home of the USB Implementers Forum (USB IF), and at this link you can obtain the USB 2.0 specification documents.

The Universal Serial Bus (USB) has become the de facto connection between computer devices. Chances are if the connection isn't video related such as HDMI or networking related such as Ethernet, it will be USB. This is an actively developed protocol that is extremely versatile. It allows for fast speeds, reasonable distances and has provisions to deliver power to peripheral devices. The USB protocol provides for streaming music data to and from numerous devices such as DACs, speakers and microphones. USB devices can be daisy chained, and all communication between devices is handled by the host controller. USB 2.0 has become ubiquitous in terms of computer audio and music playback. That's inherently a good thing, but as we'll see, it does require some consideration when USB is applied to streaming audio data.

Like any computer communications protocol, there is a committee of scientists and engineers that maintain the specifications. For USB that is the USB Implementers Forum (USB IF). USB was developed in the mid 1990's, and has gone through a number of revisions and versions. USB 1.0 has Low Speed (1.5 Mbit/s) and Full Speed (12 Mbit/s) data rates. Computer audio is based on USB 2.0, which is also referred to as High Speed. USB 2.0 was released in April 2000, and it has a maximum signaling rate of 480 Megabits per second (High Speed or High Bandwidth). That is up to 480 million ones and zeros being transmitted every second. USB 3.0 was released in 2008 and it transfers data at 5Gbits per second, referred to as Super Speed.

When you connect a USB endpoint (your DAC) to a USB host (your computer music server), the host establishes a communication link with the attached device. At this point, a unique ID is assigned and transfer rates are established. The host controller manages all of the traffic to the endpoint. If the endpoint device is disconnected, the host must reset and start the process over again. This typically takes only a few seconds. The cable connectors are typically different on each end of the cable. For Computer Audio those connectors are typically a standard detachable, high-/full-speed cable that is terminated on one end with a Series "A" plug and terminated on the opposite end with a series "B" plug.

The USB pin-outs for type 'A' and type 'B' connectors.

There are four signal wires in a typical USB 2.0 cable. There are two data lines, one for data plus and one for data minus, which are combined for a two-wire differential signal. There are two leads for power, one for 5 volts called the Vbus and a ground wire. The D± signals used by Low, Full, and High Speed are carried over a twisted pair to reduce noise and crosstalk. They are in a half-duplex configuration. Differences in propagation delay between the two signal conductors must be minimized. The maximum allowable cable length is determined by signal pair attenuation and propagation delay. The cable impedance must match the impedance of the high-speed and full-speed drivers.

USB cables with shields must be grounded to the end connectors (at least one). A high-speed cable should have twisted pair data lines and have an impedance of 90 ohms +- 15%. The current draw from the Vbus is 100mA for low-power and un-configured devices. The maximum current draw for high-power devices is 500mA.

The two data wires should be twisted, especially in USB 2.0 and 3.0 configurations.

There are four different USB data transfer models: Control, Interrupt, Isochronous and Bulk. We will concentrate on isochronous transfer in this article, because it is the transfer flow used by computer audio data packets. Keyboards and mice use the interrupt model, and raw data such as files use the Bulk transport model. You'll see that audio over USB is not you normal USB data flow.

USB defines four transfer types:

  1. Control Transfers: Request and response communication initiated by the host software, typically used for protocol command and status information.
  2. Isochronous Transfers: Continuous communication between host and device, typically used for time relevant information such as audio. This type preserves time elements encapsulated in the data, but does not allow for resending data if errors occur.
  3. Interrupt Transfers: Low-frequency, bounded-latency communication, such as for keyboards or pointing devices.
  4. Bulk Transfers: Large packet communication typically used for data such as files. These transfers can use a large amount of bandwidth, but can be delayed until bandwidth is available. Data transfer is assured if errors occur by requests to resend.

Isochronous Transfers

Isochronous transfers generally imply constant rate, error tolerant transfers. If there is a delivery failure due to an error, there is no attempt to deliver the data again. The packet size of high-speed data is 1024 bytes per microframe. Each frame may contain 2 to 3 packets. An example of a large error is no packet being delivered within a frame due to an error on the cable (the bus), or a scheduling delay in the operating system of the host computer software. In this case the receiving endpoint doesn't get the Start Of Frame (SOF) bits. While it notices this error, it can't ask for the data again, otherwise that would cause an even larger delay and further loss of data, so it just carries on processing the next packet.

Normally, handshakes would be returned to tell the transmitter (music server) whether a packet was successfully received or not. For isochronous transfers, timeliness is more important than correctness. Considering the relatively low error rates expected on the bus, the protocol is optimized by assuming transfers normally succeed. The endpoint (DAC) can recognize that an error had occurred, but it doesn’t halt, but simply continues processing the next packet of data. It's up to the firmware in the DAC to use intelligent logic to mitigate errors as best it can.

Isochronous Packet transfer in Frames (Microframes).

Since no retries are ever done for high bandwidth isochronous endpoints, the device must use Packet ID (PID) sequencing to detect one or more lost or damaged packets within a frame. Delivering isochronous data reliably over USB requires careful attention to detail. This responsibility is shared by each of these USB entities: the device (DAC) firmware, the Bus (USB cable), and the host computer controller.

Because time is a key part of an isochronous transfer, it is important to understand how packet timing is maintained. There are multiple clocks involved in the transmission of USB data.

  • The Sample Clock: This clock determines the natural data rate of audio samples. This is the original sample rate of the music, and is the same on the computer host software and on the device (DAC) firmware function.
  • The Bus Clock: Packet data on the USB cable runs at a 1.0 millisecond period (1 kHz frequency) on full-speed, and at a 125.0 microsecond period (8 kHz frequency) on high-speed. The clock period of the Bus is indicated by the rate of the Start Of Frame (SOF) packets.
  • The Service Clock: This rate is determined by the computer or music server hardware clock running the host operating system. It is buffering data to be channeled into the packets for transmission down the USB cable.

When you consider that these three clocks are running at different rates, and that there is no provision for error correction once the packets hit the Bus, audio connectivity between server and DAC gets complex. With audio data streams, some form of sample rate conversion is required. This is a form of rate adaptation. Instead of error control, sample interpolation is used in software to match incoming and outgoing sample rates. Depending on the interpolation techniques used, the audio quality (distortion, signal to noise ratio) of the conversion can vary significantly. In general, higher quality requires more processing power and sophisticated USB firmware.

The clocks provide a strict periodicity for the flow of data. When an error does occur such as a bad packet, it's important for the receiver to recognize that an error occurred so that it doesn't disturb the sequencing of further packets. This regular periodic data delivery provides a framework that is fundamental to detecting missing data. Any errors that occur within the packet payload contribute to digital noise in the signal, and are hopefully kept to a minimum. While packets do contain a CRC (Cyclic Redundancy Check), there isn't much that the receiver can do to correct the error, except account for it and do the best it can in the USB converter firmware to maintain the timing of the music.

For isochronous data to be communicated reliably, the three clocks identified above must be synchronized. If the clocks are not accurately synchronized, several issues arise that are bound to be undesirable.

  • Clock Drift: Two clocks that are essentially running at the same rate can in fact have minor differences that result in one clock running faster or slower than the other. This variation in time can lead to having too much or too little data when it is expected to always be present at the time required.
  • Clock Jitter: A clock may vary its frequency over time due to changes in temperature, or minute inconsistencies in the quartz oscillator. This jitter may alter when data is actually delivered, compared to when it is expected to be delivered.
  • Clock Phase Differences: If two clocks are not in phase, or beating at exactly the same instance, different amounts of data may be available at different points in time as the beat frequency of the clocks get out of sequence over time. This can lead to quantization and sampling related artifacts.

To summarize, whether the delivery is successful or not, the data flow between host computer and DAC endpoint must remain in synchronization as designated by the transaction period (bus packet timing). Once a receiver has determined that a data packet was not received, it should know the size of the data that was missed in order to recover from the error. If the communication flow is always the same data size per microframe, then the size is a known constant. In this way the audio signal will remain in time as determined by the frame rate, and therefore the PCM sample rate of the music will remain in time. Noise and jitter may still be present in the data signal, but are kept to a reasonably low level.

This is how USB data is encoded - Non Return to Zero Inverted (NRZI).

The following points are provided to impress upon the reader the details that are inherent in the engineering of USB communications. USB employs NRZI data encoding when transmitting packets. NRZI is a method of encoding data in which ones and zeroes are represented by opposite and alternating high and low voltages where there is no return to zero, or reference voltage, between encoded bits.

To ensure that long series of the same bit are properly read, bit stuffing is used in the data packets. A zero is inserted after every six consecutive ones in the data stream before the data is NRZI encoded, to force a transition in the NRZI data stream. This gives the receiver logic a data transition at least once every seven bit times to guarantee the data and clock synchronization lock. The receiver must decode the NRZI data, recognize the stuffed bits, and discard them, and then process the data values.

A high-speed cable should have a maximum one-way time delay of 26 nanoseconds. The maximum skew introduced by the cable between the differential signaling pair (D+ and D-) must be less than 100 picoseconds. Cable attenuation also must remain within acceptable standards per frequency ranges. Since the Host Controller must meet clock accuracy specifications of ±0.05%, they will automatically meet the frame interval requirements (time between successive frames). When clock accuracy is not met, jitter occurs. Jitter can also be caused by timing variations due to buffer delay, rise and fall time mismatches of digital pulses, internal clock source jitter, and noise and other random effects. Jitter will also be caused by mismatches in the source and destination data rates (frequencies). A high-speed receiver firmware function must reliably recover data with a peak to peak jitter of up to 30%.

Hopefully this article has given you an appreciation for the complexity and high quality of USB communications. Our modern computing world relies heavily on the USB protocol. While there are great benefits to isochronous transfers of USB data, there are pitfalls as well. For Audiophile USB to be of sufficient high quality, strict attention needs to be paid to the reduction of noise and jitter. However, when successful, USB audio can be highly accurate.

Ken Matesich, 2016