Why random rotations are good for RoPE


random RoPE

Jerry Xiong created the above picture, which started these findings. With axial RoPE (Rotary Positional Embedding), positionally attending to a single pixel causes incidental attention to a cross, because RoPE’s sine waves align. This is well-known. But he discovered that with random RoPE rotations, the cross disappears.

Random-rotation RoPE is better than your current 2D axial RoPE, and Jerry’s blog post has an even better version. Jerry’s preferred explanation is that covering angles uniformly is the goal.

However, my preferred explanation is that incoherent angles are the goal. So this blog post is an appendix, describing an optimality condition for RoPE, the math behind this RoPE variant, and where incoherence arises.

Focus on a single token

When comparing the attention operation to other architectures, the main talent of attention is its ability to attend to single keys (needle-in-a-haystack retrieval). If you disagree, then either your understanding of ML architecture is poor, or you know too much mechinterp.

Accordingly, the optimal RoPE should positionally focus on a single token. Let’s work out the math.

Notate the center of an image as complex 0+0i0+0i, which RoPE is attending to. With k1,2,...,nk \in {1, 2, ..., n} as indices, let each complex RoPE term be ck=eiθkrkc_k = e^{i \theta_k} r_k, where θk\theta_k is the angle and rkr_k is the frequency’s magnitude. The attention score at a point zz is kcos(zck)\sum_k \cos(z \cdot c_k), where the dot is the real-plane dot product, considering the complex plane as R2\mathbb R^2.

If SS is the canvas, then the total positional energy of attention is its squared value, integrated over the canvas: E=S(kcos(zck))2dxdyE = \iint_S \left(\sum_k \cos(z \cdot c_k)\right)^2 dx dy, with z=x+iyz = x + i y. The distance-adjusted energy is D=S(kcos(zck)z)2dxdyD = \iint_S \left(\sum_k \cos(z \cdot c_k) \|z\|\right)^2 dx dy; it is the same thing multiplied by z2\|z\|^2. The ideal RoPE minimizes D/ED/E, representing that attention is focused at the point 00.

The frequencies rkr_k are fixed, being used to control specificity and range.1

So we can only pick the angles θk\theta_k.

At z=0z=0, the attention score is kcos(zck)=kcos(0)=n\sum_k \cos(z \cdot c_k) = \sum_k \cos(0) = n. This represents concentrated attention. Our goal is to minimize energy away from the center, to make DD small. In the absence of a cancellation trick, the next best tool is to make attention behave like noise - so that different cos\cos terms do not line up except by chance.

At points away from the center, random-rotation RoPE evaluates each channel at a position uncorrelated with other RoPE channels. The average of uncorrelated values behaves like noise, scaling as 1/n1/\sqrt n, while the average value at the center remains 11. So as nn increases, attention can better focus on the center. Compared to axial RoPE, the stray attention on the cross doesn’t move elsewhere - it is erased.

For non-random rotations, the first guess is to set θk/π\theta_k/\pi to multiples of the golden ratio. Two rotations overlapping would mean aφ=bφ+ca \varphi = b \varphi + c for integers a,b,ca, b, c, leading to c/(ab)=φc/(a-b) = \varphi. By Hurwitz’s theorem, the golden ratio is the hardest number to approximate by rationals, so it’s good at avoiding overlaps. This produces uniform RoPE in Jerry’s blog post, where the incorrect multiple of 2 is my fault. fluffy renamed this construction Golden Gate RoPE, as it is unrelated to uniformity.

But the golden ratio neglects that our frequencies are exponentially increasing - interactions between adjacent frequencies are the strongest ones. As the separation between frequencies increases, we expect the difference between rotation angles to transition incompletely from π(1+5)/2\pi (-1 + \sqrt 5)/2 toward π/2\pi / 2, because perpendicular angles are the most uncorrelated. And the highest and lowest frequencies will have rotations that are only surrounded on one side, not two, so they will also need special logic.

These issues are fiddly and have local minima, so let’s pretend we have infinite channels.

Suppose the frequencies are exponentially distributed as 2πsi2 \pi s^i. Minimize the energy EE inside the annulus 1/s<z<11/s < \|z\| < 1. This captures a slice of RoPE frequencies; the total energy away from 00 is a concatenation of such annuluses. Truncate the frequencies to 2πs4,2πs3,...,2πs3,2πs42\pi s^{-4}, 2\pi s^{-3}, ..., 2\pi s^{3}, 2\pi s^{4} for numerical convenience.

angle:
s:
n (channels):
When varying angle, lower energy and lower D/E are better. Higher n is more accurate, but causes lag. The golden ratio corresponds to angle 1.94.
max frequency:
num frequencies:
D/E spikes at rational multiples of pi. By nshepperd.

Here, D/ED/E is measured when rk=(max frequency)k/(num frequencies1)r_k = (\text{max frequency})^{k / (\text{num frequencies} - 1)}.

Observations

At high ss, broad regions of angles perform comparably. This is because the higher separation between frequencies causes them to interact less.

Random rotations are not as good as chosen ones, but are better than axial RoPE. Beware that the truncation to nn means it’s easy to overfit. The optimal angle trends toward π/2\pi/2 with higher ss, but it’s unclear if this visualization is accurate enough to count as evidence for this effect.

The closer ss is to 1, the more important it is to choose a good angle.

The failure of axial RoPE (angle 00 or π/2\pi/2) is caused by having high energy in a line - which fails to decay away from the origin. This is the penalty of correlated sine waves.

In practice, I don’t expect anyone to care enough to optimize more than the single angle. π(1+5)/2\pi (-1 + \sqrt 5)/2 will do fine for the lazy.

Credit

Claude Code did the visualizations. To discuss this article, I’m in the EleutherAI discord. Or, qkxgecetu at mozmail dot com will work for a month.

Jerry’s blog post cites prior works such as RoPE-Mixed, but despite the similarities in final method, his discovery does not descend from them. His random-rotation picture, in which incoherence can be recognized, deserves most of the credit, a la Rosalind Franklin.

nor wrote a third post, on N-dimensional RoPE.

Footnotes

  1. Partial RoPE (rotary_dim) originated in the pre-Tri Dao days when RoPE was costly, so people rotated only part of the head. Nowadays, partial RoPE is no longer a speedup, though FlashRotary still supports it. My suggestion is to change the low frequencies instead of using partial RoPE.