Wednesday, November 11, 2015

PCIe 3.0 for beginners

The rapid adoption of PCI Express (PCIe), is delivering higher bandwidth to an ever-growing number of industry segments. With PCIe Gen2 now firmly establishing a foothold, PCIe Gen 3—and its doubling of effective data rate—is already on the launching pad.

This article will focus on some of the key physical-layer features and serial/deserializer (SERDES) decisions that make PCIe the success it’s become, then address the critical “curves ahead” designers will face with Gen 3. In addition, it details the pros and cons of some of the critical physical-layer tradeoffs necessary for designers journeying into the world of PCIe Gen 3.

AT THE PHYSICAL LAYER, WHAT IS PCI EXPRESS?

PCIe physical layer has its genesis in Fibre Channel. Common in both, the eight bits of data presented for transmission are encoded into 10-bit symbols. Three functions follow from 8B/10B adoption: limiting maxim data run length; error detection; and limiting data wander.

Maximum run length: With 8B/10B encoding, effectively, the maximum number of bit intervals for which there can be no data transition is five. Data transitions are critical for clock recovery. At the time of initial adoption, the typical clock and data recovery mechanism consisted of voltage-controlled oscillator (VCO)-based PLL designs. Even with a distributed common reference—as is done in PCIe—with PLL-based recovery, long absences of data transitions results in the inability of the VCO to correct for and maintain absolute data lock. If severe enough, this drift results in burst errors.

By ensuring a worse-case transition density, early SERDES designers had a means of ensuring clock recovery with PLL-based designs and could progress on the development of physical link that became Fibre Channel and later PCIe. As loop techniques and low leakage charge pump technology improved, the stringency on run length became less critical. And too, as discussed later, the advent of Delay Lock Loop clock/ data recovery circuits, when operated under the condition of a distributed common reference clock, make reliance on data run length even less a concern.

Error detection—running disparity: In PCIe (as well as Fibre Channel) 8 bit sequences are segmented into 10 bit symbols. Symbols are further divided into two sub-classes, Data and Control. Control symbols allow physical link action, such as channel alignment, packet start, or skips. The Comma symbol is a familiar example. The data symbols are the encoded information. These symbols (data and control) can have either a positive, negative, or zero disparity—that is, disparity defines the count difference between the number of 1s and 0s in the symbol. By coding rule, the disparity is limited to +1,-1, or zero. A +1 disparity means there are six 1s and four 0s in the symbol; the reverse for a negative disparity.

Disparity has several error-checking benefits. At the interface of the receiver, any symbol received with a disparity greater than 1 can be flagged as a potential bad symbol. In addition, the protocol requires that, if a symbol with positive disparity is sent at one instance, the next symbol must be negative or zero disparity—this is known as running disparity. This “running tab” helps in both error detection and signal integrity, as we’ll see next.

Data wander: PCIe is an ac-coupled protocol. Long sequences of unbalanced data disparity will result in data wander, which is a desensitization of the receiver due to the differential lines developing an offset.

To illustrate the effects of data wander on a differential receiver, a simple circuit is shown with two different pattern extremes, excessive to demonstrate the point (Fig. 1). The circuit is a simple, ideal buffer with a common mode of 2.5 V, ac-coupled into 50-Ω lines.

A series of plots show the difference as seen at the input to a 50-Ω terminated receiver under alternate data patterns (Figs. 2, 3, and 4). The bit rate is 2.5 Gbits/s (Gen1). As can be seen from the inflection at 33 µs in the composite figure, the pattern shifts from a purposely poor dc balanced input with long non-transition periods, to one for which the run length and 1/0 bit balance is similar to 8b/10b encoding (Fig. 2). Observing the zoom views, if the balance is poor with long periods of no bit change, the available difference signal at the receiver is reduced—here the longest static period is repeated at 32 ns (80 bits) long (Fig. 3).

The zoom view of the balanced code shows the differential signal at the receiver bears close resemblance to the signal launched at the input (Fig. 4). In a real system, if no encoding (or poor encoding) schemes were used, the signal at the receiver would result in dc modulation. The effect is a reduction in channel margin, which becomes more significant with longer channels because of their higher losses. Also, as discussed later, data wander can present special problems to some adaptive receiver structures as they attempt to optimize the incoming data pattern.

OTHER PCIe PHYSICAL-LAYER FEATURES

Receiver detection: PCIe uses an ingenious means to recognize both the presence of a physical link and channel width. The specification exploits the fact that an un-terminated, ac-coupled transmission line will have a very different charge time when the line is terminated versus open. Each PCIe transmitter, at the commencement of linkup, produces a low-frequency “ping” on each of the differential TX outputs. The transmitter includes a simple detection circuit to monitor the line response to this ping. With no receiver attached, the edge rate (and amplitude) of the line change is much higher than when a receiver is present. Because the specification has a defined range of coupling capacitance and the receiver termination, a distinct, detectable time constant range defines when a receiver is present or not.

Inferred link idle: PCCIe places a premium on overall system power. Links that don’t need to carry traffic for prolonged periods (i.e., hundreds of milliseconds) are allowed to orderly quiet the link transmitters. With the link electrically quiet (no data transitions), the receive circuit senses the line and checks if the incoming signal is above approximately 100- to 175-mV EIDLE limit before allowing data to pass beyond the PHY layer. While this is an excellent power-saving feature, as we’ll see later in this article, longer channels and increased data rates mean that valid data could be present below this 175-mV threshold.

Because many of the Gen2 circuits of today and the Gen3 circuits in the non-too-distant future have receiver sensitivities significantly better than EIDLE, the specification now allows for the inference of electrical idle and the exit based on a specific training set sequence that high-quality receivers can detect. Doing so allows PCIe to operate over long/lossy links over which the channel losses can easily reach 20 dB or more (while still saving power when the link isn’t needed).

ACK/NACK protocol: These physical-layer rules are part of another PCIe error-checking and correction mechanism. If an incoming packet has a receiver error (not necessarily a polarity error) such as cyclic redundancy check (CRC), the system provides what’s known as a packet NACK, which forces the transmitting device to resend. If the offending packet is NACKed three times, the link is forced to enter retraining. The ACK/NACK protocol is a powerful error checking and correcting mechanism in PCIe that’s observed by each and every point to point (link) within a system.

De-emphasis: If the physical bandwidth of a given link is less than the required transmit bandwidth, then all of the signal’s frequency components are transported with unequal emphasis—that is, the attenuation of any signal component at the low end of the frequency spectrum will be less than that of the higher portion of the spectrum. Unfortunately, few physical channels maintain a flat amplitude response out past 5 GHz and beyond (optical links are one example, but they have other dispersionary effects). The data bandwidth of high-speed protocols such as PCIe often exceed the bandwidth of most lower-cost PCB materials and/or legacy implementations. However, it’s the ability to deliver higher performance while minimizing the necessity of “forklift” upgrades that’s a critical factor to enabling most backplane applications.

A max loss approximation of a 50-cm, 100-Ω OIF channel is shown (Fig. 5). The channel loss (S21) polynomial is defined in the ANSI FC-Physical Interface specification. The model is first order and not intended to show reflection losses that further diminish the solution space for clean data delivery. As can be seen from the graph, frequency components at the higher end of the data spectrum are more attenuated than lower-frequency components. Two amplitude pulses that are equal, but differ significantly in pulse width, will arrive at the receiver with very different amplitudes.

A time domain example is given of the frequency-dependent loss and reflections due to channel loss and reflections (Fig. 6). A PEX8648 Reference Design Card is operated at 5 Gbits/s. The channel includes multiple PCIe connectors, a PCIe compliance card, SMA to Hz-Md paddle card, and 75c m of backplane operating at 5 Gbits/s.

To show how transmitter de-emphasis provides channel correction, a 5-Gbit/s signal with boosted post cursor (PCIe de-emphasis) is launched into the backplane (Fig. 7). A second plot shows the signal at the end of the backplane (Fig. 8). Both the high- and low-frequency amplitudes are nearly matched after traversing the channel. By minimizing large variations in the instantaneous amplitude feeding the receiver, improved ISI, receiver harmonics, and amplitude-to-phase noise jitter are each better managed and improve the resultant data eye.

SO HOW DOES GEN3 STACK UP?

Conceptually, PCIe Gen3 is simply a doubling of the available data bandwidth over Gen2—from 4 to 8 Gbits/s. Several key differences arise in achieving PCIe Gen3 data rates. To address these differences, we’ll look at several concepts: data rate versus line rate; pulse response; physical-layer architectural decisions, test needs, legacy issues, and equalization architectures and tradeoffs.

Data rate vs. line rate: As discussed previously, PCIe takes eight bits of data and adds two bits of encoding. This results in a 20% overhead in line transmission. The 2.5-Gbit/s line rate of Gen1 has a base data delivery rate of 2 Gbits/s (excluding packet protocol data-link-layer overhead). The PCIe Gen2 5-Gbit/s line rate has a base delivery of 4 Gbits/s. So, for Gen3 to effectively double the speed of Gen2, it’s necessary to deliver a base data rate of 8 Gbits/s.

To do so, Gen3 removes 8b/10b encoding and consequently takes back much of the 20% overhead of encoding. As data rates increase, the premium on usable channel bandwidth also goes up. The rationale for adoption is similar to other protocols looking to reduce overhead (i.e. 64/66b encoding in 10GigE and 10Gig Fibre Channel).

However, with these changes come concerns of clock recovery, data wander, harmonic suppression, and dc balance, as discussed previously. Using the 50-cm OIF channel model as an example, the difference between 4 Gbits/s and 5 Gbits/s is minimally an additional 5 to 6 dB of channel loss that the SERDES would need to resolve. Similarly, with increased speeds comes the likelihood of channel resonance due to discontinuities. These resonances result in additional “dips” in the channel response, which cause distortion and require compensation.

Data Encoding: Rather than 8B/10B, Gen3 will employ alternate encoding that will have a worst-case non-transition period on par with alternate protocols such as 64B/66B encoding. (Actual expectation is a 128-bit pattern with two sync bits.) While something on the order of approximately 64 to 128 bits will be the likely worst-case non-transition period, some means of data scrambling will also be employed.

Without this scrambling, it’s easy to envision types of data streams (e.g., static, uncompressed video) for which ac-coupling and clock recovery could become a problem. With scrambling, chances of obtaining long runs approaching 64 to 128 bits are significantly reduced. The keys to whatever scrambling polynomial(s) are finally chosen include the ability to maintain dc balance, ease of sync, and EMI suppression/balance (see previous section for extreme effects of poor dc balance and data transition).

Pulse response analysis: A heuristic discussion of pulse response serves as an aid to visualize the concepts of equalization. Pulse response is an alternate means of channel observation. While S-parameters look at the channel in the frequency domain, pulse response is a means of quantifying a channel in time. An example of what an 8-Gbit/s pulse response might look like entering, then leaving, an arbitrary channel is shown in Figure 9.

At successive time samples, the input pulse has a single non-zero value at one sample instance (0.125 ps), and zero at successive sample points. The output pulse is expected to have some arbitrary delay time commensurate with the propagation delay of the channel. If the channel was ideal, with minimal reflections as well as wide bandwidth and low loss, the resultant output pulse would look much like the input pulse, only shifted in time.

It’s clear that despite a clean input, bandwidth limitations, losses, and discontinuities create pulse spreading (Inter-symbol interference) such that the output response isn’t single valued. The job of equalization is to make the output response as close to single valued (only one non-zero sample point) as possible.

As a visual aid to the potential impact of discontinuities, a 2.5-Gbit/s simulated eye diagram of an ideal transmitter is connected to a 6-in. transmission line (Fig. 10). Within the transmission line model, lumped capacitance is used as a simple via approximation and placed approximately one-third the length of the 6-in. trace (Fig. 11). Alternate simulations show the effect of an additional PCB via and intentional impedance mismatching (Fig. 12).

Used as a simple visual aid, these simulations at 2.5 Gbits/s exhibit how discontinuities create ISI and reduce eye quality both in the horizontal and vertical axis. Channels that behave as such present significant challenges to high-speed data recovery. (We’ll return to pulse response when we discuss various receiver architectures.)

DATA TRANSMISSION AND RECOVERY: ARCHITECTURAL DECISIONS

DLL vs. PLL: Data pattern and anticipated channel behavior plays important roles in defining both the transmitter and receiver design methodologies. For example, the change in data encoding of Gen3 results in longer data run lengths with no transitions. This affects data recovery. Whereas DLL-based clock/data recovery methods (operated with common base reference) are well suited to operate through long transition-less periods, this condition does represent a potential strain on PLL-based data-recovery schemes and requires pre-design verification.

Describing the two architectures, a DLL-based recovery scheme compares the edge of the incoming data stream to one of several quantized phases of a fixed reference. At each data transition, a decision is made as to whether the clock phase or the data edge was first arriving (early/late detection). The early and late counts are accumulated and after a predetermined interval—the counter is effectively the DLL bandwidth—a decision is made to either advance or retard the recovered clock phase.

The key is that if no data transitions occur for an extended period of time, the bandwidth counter doesn’t advance and the clock phase remains unchanged. When the transmitter and receiver at both ends of a link are operated under the condition of a distributed base reference (common in PCIe), CDR tracking can effectively be turned off once data alignment is achieved without incurring data loss.

In contrast, with this same reference clock distribution, a PLL architecture requires spectral content from the data to maintain phase alignment with the incoming reference clock. If no data transitions are present, the charge pump of the PLL will attempt to hold its value. However, it will have a finite time constant, unlike the DLL design. Still, with advances in low leakage design, PLL stability can be tens of bits to potentially more than 100 non-transition bit intervals.

That said, when either of these architectures are operated with asynchronous clocking, both will have a tracking time constant. Thus, it will be important to evaluate the ability to acquire and track under anticipated data densities for each, as well as their maximum frequency offsets.

Transmitter emphasis: If a given channel is continuous but of low bandwidth, data transmission can be improved with more transmitter emphasis to balance the high-frequency losses at the receiver. While post-cursor emphasis has been defined in both PCIe Gen1 and Gen2 (and, eventually in Gen 3), it’s very likely that the best transmitter designs will incorporate something more.

Building off the effort of 10Gig OIF, PCIe Gen3 will likely incorporate a transmit structure capable of both pre- and post-cursor emphasis. (Pre-cursor and post-cursor are becoming the more proper means of defining transmitter emphasis.) Such a structure allows more control and pulse shaping of the launched signal and more flexibility to mitigate crosstalk and reflections in legacy channels. (Refer back to the output response of Figure 9. The non-zero value at 0.125 ns represents the pre-cursor response. Values at 0.375, 500 ns represent successive post-cursors).

In an example three-tap (1 pre, 1 post) transmitter, the tap numbers -1, 0 and +1 refer to the relative position of the signal feeding the summing junction, where Tap(0) is the main (cursor) bit value (Fig. 13). The gain of each tap is independently adjusted, where typically the main tap has a large value (such as approaching “1”) and the side taps are adjustable negative fractional values that subtract from the main tap. Through the investigative work of Liu and others on the 10GBASE-KR equalization specification, precursor adjustability has been shown particularly beneficial to the DFE receive equalization scheme, discussed below. Provided are time representations of PCIe de-emphasis (post cursor) (Fig. 14) and pre- and post-cursor emphasis (Fig. 15).

Receiver options—decision feedback equalization (DFE): DFE is a powerful tool for legacy channel management (Fig. 16). DFE can’t improve every possible channel. However, many channels that have insufficient bandwidth and/or reflections, which results in no discernable RX eye opening at the receiver, can be made to operate robustly.

Going back to the earlier pulse response figure ( Fig. 9, again ), a DFE functionally operates on the pulse-response principle: The ideal response entering the receive slicer should have a single, non-zero entry at the main sample point and “zero” entries at subsequent samples around the main value. The circuit consists of a multi-tap finite impulse response filter, whose input and output are looped around the receive data slicer. Circuit algorithms such as LMS are then used to adjust the individual tap weights toward the ideal response discussed above.

While the circuit, by virtue of the feedback from the slicer, does meet the classical DSP definition of IIR (output feeds back into the input), the circuit is often considered as FIR because the span of control for which the DFE can modify the pulse response is limited to the number of taps in the circuit. This is different from the next receiver structure we’ll address, CTLE. However, the advantage of DFE is that within the span of control, the ability to adjust for discontinuities is much more granular and tap-to-tap independent.

Figure 17, a pulse response example of DFE, illustrates DFE principles. The circuit is a three-tap DFE with a presumed bit rate of 8 Gbits/s. The orange arrow represents the pulse response shaping caused by the pre-cursor emphasis of the three-tap transmitter discussed earlier, and not by the receiver. The three green arrows span the range (in time) of discontinuity control that each of the three independent RX taps may possess. Note that the fourth pulse (at t = 0.625 ps) shows no improvement. Because this DFE is only three taps, it’s not possible to deterministically control any discontinuity outside of a 0.375 ps (3UI or three-tap) window from the main pulse.

One other point of note about pulse responses: Intermediary non-zero excursions of the response, which aren’t located on the baud lines, don’t adversely affect the signal quality. With a baud rate equalizer, it’s important to maximize the energy of the main cursor and minimize the energy at the pre- and post-cursors each time a sample is taken.

DFE structures have the following advantages/disadvantages:

Autonomous adjustment
Independent control of each tap within the circuit
No magnification of channel noise
Burst errors/error propagation (mitigated PCIe replay scheme)
Finite control range that expands linearly with the number of taps (size/power)
Adjusts post-cursor response only
Difficulty converging in the presence of significant data wander

It’s the ability of a properly functioning DFE circuit to autonomously adjust to the characteristics of the channel that makes it very powerful. But just as with any high-end vehicle, performance comes at a cost—power. DFE can add anywhere from 15% to 30% of the overall SERDES power for an optimized PCIe design. While the PCIe Gen 3 specification doesn’t require DFE, it’s probable that many high-performance SERDES will likely employ such a structure. Depending on size and power, there’s a likely potential that the five-tap DFE, common in the OIF community, also will become commonplace in PCIe.

Receiver options—fixed linear equalization: Linear equalization poses an attractive alternative to pure digital forms due to lower power consumption. Because PCIe places such a high importance on low power consumption, linear options have been an attractive alternative in Gen2.

On the other hand, Gen 3 devices that maintain the absolute lowest metric in power and size are likely to opt for linear equalization only. In well-designed new applications, such devices can provide solid performance. The questions will be their applicability to service today’s legacy backplanes, and the tradeoff between channel design and silicon cost. The evolution of linear equalization can be seen with a simple channel-equalization scheme (Fig. 18).

This typical passive-line equalizer operates by reducing the overall signal amplitude. Energy in the high-frequency band passes through the capacitor with minimal loss. Low-frequency components pass through an additional attenuation (R), causing an overall amplitude balance that counterbalances the high-frequency roll-off of typical channels. This conceptually mimics the same functionality as PCIe transmitter de-emphasis (post-cursor) equalization.

To better visualize the single-pole response of such a circuit and how it could modify the signal spectrum, Figure 19 depicts a frequency plot for several potential filter settings for a 5-Gbit/s link. Channel insertion loss is 6 dB. Signal components below 2.5 GHz undergo varying amounts of additional attenuation. Components near 2.5 GHz (5 Gbits/s fundamental) pass through the channel at the base insertion.

Going back to the earlier discussion of the channel and how the higher frequencies are more attenuated—the filter attempts to implement an inverted response so that the net output is flat across frequencies. Typical of many Gen2 PCIe devices, several stages of these filter circuits are manually switched in by the user.

Adjustable linear equalizers: Notice that the transfer function, H(s), for the network is an RC time constant that decays to infinity—it has what’s known as an “infinite impulse response.” Adaptive linear filter options differ by placing the linear element either inside or outside the decision feedback loop. With a continuous time linear equalizer (CTLE), the linear element is placed before the slicer. The receiver structure conceptually employs a similar design to the passive-line equalizer, but will allow a range of impedance and gain control (variable R, for example) so that filter and/or corner frequency can be adjusted automatically and overall amplitude can be adjusted. Proprietary algorithms perform the adjustment in order to observe the data before and after the slicer. A comparison of the filtered response to that of the sliced response can be used to make adjustment.

A CT-DFE places the linear element in parallel with the slicer. This structure has similar issues as CTLE, with limited control/adjustability. However, this structure doesn’t provide noise amplification and is often found as a “prefilter” in DFE designs to remove data wander.

So what might an adaptive linear response look like as compared to DFE (Fig. 20)? The linear adaptive filter has a smooth, decaying response that, like the classical capacitor, continues to infinity. If the channel response is also smooth and continuous (i.e., coaxial cable), the circuit can match the channel very well.

In this example, the CTLE circuit attempts to “curve fit” each discontinuity, including the perturbation at 0.625 ps. The three-tap, classical DFE discussed earlier would not have a “control lever” to minimize this fourth sample point. If the actual pulse response was that of a uniform channel, such as cable, coax, or backplane with minimum discontinuities, a first- or second-order CTLE can often be as effective as a five- to seven-tap DFE.

The problem is that most designs aren’t uniform. While imperfections such as via discontinuities, connectors, coupling, and trace variability may have been sufficiently planned for in yesterday’s channels, they now become “in-band” perturbations at today’s speeds. Because these discontinuities can be found at any point along the channel, the advantage of the DFE is the independent adjustment for “in-band” perturbations.

Linear adaptive equalization has the following advantages/disadvantages:

Low power
Minimal added space
Addresses pre- and post-cursor ISI (CTLE)
Very effective for continuous, band-limited channels such as cable and short links
Filter path can be cascaded for higher response
Can result in noise and crosstalk amplification (CTLE)
Performs poorly for highly discontinuous channels
Signal adjustment interacts with every discontinuity, without independent adjustability.

CHANNEL IMPEDANCE AND TESTABILITY

One PCIe physical-layer change considered sacrilegious to many RF designers was the Gen1-to-Gen2 decision to change channel impedance to 85-Ω differential. The 100-Ω characteristic impedance has long been a standard design criterion for high-speed buffer/SERDES designs and test-equipment manufacturers alike.

The PCIe committee recommended a change to 85 Ω based on empirical simulation analysis of channels employing the present PCIe connector. In moving to 85 Ω, it was determined that more channels would better interact with the PCIe connector and pass PCI-SIG mask requirements. All indications are that the Gen3 PCIe connector will minimally keep the same form factor, if not same connector impedance. That means that a similar resultant channel impedance of 85 Ω will be the likely recommendation. Channel compliancy will likely be based on a complete S-parameter model, which includes the TX and RX buffer, package, connectors, and channel.

In practical terms, impedance mismatches will exist and most users and systems will rely on the ability to adjust device impedance to minimize/adsorb reflections in the transmission line. As such, it’s very likely that buffer impedance tuning (both transmit and receive) be part of any practical design.

Furthermore, at 8 Gbits/s, observable eye quality by simple, direct measurement becomes diminished. Parasitic package interconnect alone can be significant enough to inhibit the ability to correlate signal quality at the ball of the chip to the quality at the physical buffer inside. For that reason, a common SERDES vendor requirement should be the ability to perform receiver eye measurements within the device. As such, it will give the user an understanding of how much signal margin is available in the presence of all channel perturbations.

The most effective detection circuits will measure both the vertical and horizontal eye margin. Similarly, to verify the complete end-to-end channel, bit-error-rate testers and/or generators will be incorporated in these devices. While this capability has been common in standard SERDES for some time now, it’s likely that the PRBS patterns options will increase to accommodate/validate channel capability under longer run lengths. Typical patterns such as PRBS7 have a similar data-transition density as 8B/10B encoding. With Gen3, it’s likely that patterns such as PRBS24 to PRBS31 will be needed to provide a more typical channel response.

SUMMARY

The PCIe community will continue to place significant pressure on SERDES designers to minimize power and space, as well as bring cost-efficient, high performance designs. As was stated earlier, PCIe Gen3 will make possible legacy channel functionality at 8 Gbits/s per lane. The best of these devices will incorporate both linear equalization and DFE, similar to what the best OIF/FC devices provide today.

One thing is certain: PCIe is rapidly becoming the de facto I/O standard in many systems today. With the success of power-conscious, feature-rich, yet highly efficient I/O proven in Gen 2, and the sheer performance promise a Gen3 x16 link provides, PCIe is here to stay.

No comments:

Post a Comment