ABPS

A simple analytical model

High prediction accuracy is vital especially in the case of multiple instruction issue processors. In this section we assume the analytical model proposed in [2] related to a superscalar processor that ignores stalls like cache misses and bus conflicts focalizing only about the penalty introduced by branch miss-prediction. Considering Branch Penalty (BP) as the average number of wasted cycles due to a branch miss-prediction for each dynamic instruction. The following relation can be written:

BP= C.(1-Ap).b.IR [wasted clock / instruction] (4)

Where we denoted:

C = number of penalty cycles wasted due to a branch miss-prediction;
Ap = prediction accuracy;
b = the ratio of the number of branches reported to the number of total instructions;
IR = the average number of instructions that are executed per cycle.

Now we compute how many cycles, the execution of each instruction takes, for a real superscalar processor that includes a branch predictor:

CPIreal = CPIideal + BP [clock cycle / instruction] (5)

Where:

CPIideal represents the average number of cycles per instruction considering a perfect branch prediction (Ap=100% => BP=0). It is obviously that CPIideal < 1.

CPIreal represents the average number of cycles per instruction considering a real branch prediction (Ap<100% => BP>0 => CPIreal > CPIideal).

Therefore, the real processing rate (the average number of instructions executed per cycle) results immediately from formula:

IRreal = IRreal [instruction / clock cycle] (6)

The relation (6) proves the non-linearly correlation between processing rate (IR) and prediction accuracy (Ap). With these metrics, we adapted the model to our results presented during this paragraph. Further, we use the following notations:

x = the ratio of biased context instances

1 - x = the ratio of unbiased context instances

Since Apglobal represents a weighted mean among predictions accuracies applied both to bias and unbiased branches, the biased prediction accuracy, Apbiased, can be determined .

Apglobal = x.Apbiased + (1-x).Apunbiased (7)

Let's consider the follwing example:

Apglobal = 0.936 = 0.8253*Abiased + 0.1747*0.722 (where x = 0.8253)
It results that Apbiased = 0.9813.

Obviously, if we would predict the unbiased branches with a more powerful branch predictor having, for example, with 95% prediction accuracy, involves an accuracy gain: Accuracy_gain =(0.95-0.722)*(1-x). More than that, this accuracy gain involves a processing rate speed-up according to (4) and (6) formulas. This gain justifies the importance and the necessity of finding and solving the difficult predictable branches.

Therefore, further we determined how much is influenced the branch penalty (BP) by the increasing of context length and what is the speed-up in these conditions. For this, we softly modified the initial model proposed by Chang [2] by substituting Ap with our Apglobal, according to formula (7). Thus, it is considered a penalty introduced for miss-prediction of biased branches – the term (1-Apbiased).x, respectively for considered wrong prediction of all unbiased branches (Apunbiased=0) – the term (1-x).

Model proposed by ChangOur “modified” model
BP= C.(1-Ap).b.IRBP = C.b.IR.[1– x.Apbiased] (8)

In our previous research [3] we proved a decrease of unbiased branches (1-x) by extending the context length that leads to a reduction of branch penalty (BP) according to (8), and implicitly to a greater IR according to (6). It can be written the followings relations (growth means a growth):

Context (Features Set) Length growth => x growth => BP growth => IR growth => growth Relative Speedup > 0

Next, we computed the IR relative speedup when varying the context length. Starting from the well-known metric
metric
, where L is the feature’s length, L belongs {20, 24, and 28}, we obtained the relative speedup:
Relative Speedup speedup.

The baseline processor model has an IRideal of 4 [instruction / clock cycle]. The number of penalty cycles wasted due to a branch miss-prediction considered in our model is 7. The ratio of the number of simulated branches over the number of total simulated instructions is b=8% (generated by our previous simulations [3]).

fig1
Figure 1. The IR relative speedup obtained growing the context length

Figure 1 illustrates not only the necessity of a greater number of prediction features to improve the processor’s performance, but also the necessity of new performing branch predictors that could consider a larger amount of information in generating predictions (but without scaling exponentially with input features set length).

The analytical model described above is implemented by ABPS.
ABPS computes:

Note: the speedup will be computed relative to the first configuration. Thus, the first configuration represented will allways have zero speedup.

ABPS computes the above metrics using the following mathematical background:
mathematical model
Note: The Relative speedup (S) represents the Issue Rate gain, relative to a specified configuration (above, IR(16) means the IR for a configuration having 16 bits of history length; L can be in this case: 20, 24, 28, like in Figure 1)

References:

[1] Vinţan L., Prediction Techniques in Advanced Computing Architectures (in english), MatrixRom Publishing House, Bucharest, ISBN 978-973-755-137-5, 2007.

[2] Chang P.-Y., Hao E., Yeh T.-Y., Patt Y. N., Branch Classification: A New Mechanism for Improving Branch Predictor Performance, Proceedings of the 27th International Symposium on Microarchitecture, San Jose, California, 1994.

[3] Vinţan L, Gellert A., Florea A., Oancea M., Egan C., Understanding Prediction Limits through Unbiased Branches, Eleventh Asia-Pacific Computer Systems Architecture Conference, Shanghai 6-8th, September, 2006 (“Lecture Notes in Computer Science” vol. 4186, pp. 480-487, Springer-Verlag, Berlin / Heidelberg, 2006)