707 words
4 minutes

RoPE: Rotray Position Embedding

2026-01-27

Basics#

舊的 Position-agnostic 編碼法分別有:

  1. 絕對位置編碼
pi,2t=sin(k/100002t/d)p_{i,2t} = \sin(k/10000^{2t/d})
  1. 相對位置編碼
  • e.g. Transformer-XL
  • 但這種編碼只是把位置資訊加入上下文表示, 一樣不適用線性注意力架構:
=ϕ(qm)(n=1Nϕ(kn)vn)ϕ(qm)(n=1Nϕ(kn))= \frac{\phi(\boldsymbol{q}_m)^\top \left(\sum_{n=1}^{N} \phi(\boldsymbol{k}_n) \boldsymbol{v}_n^\top \right)}{\phi(\boldsymbol{q}_m)^\top \left(\sum_{n=1}^{N} \phi(\boldsymbol{k}_n)\right)}

Transformer-XL 的相對位置編碼同時依賴 mmnn:

qmkn+qmrmn\boldsymbol{q}_m^\top \boldsymbol{k}_n + \boldsymbol{q}_m^\top \boldsymbol{r}_{m-n}

若使用線性注意力就沒辦法預先求出 n\sum_n:

n=1Nϕ(qm)ϕ(kn+rmn)vn\sum_{n=1}^{N} \phi(\boldsymbol{q}_m)^\top \phi(\boldsymbol{k}_n + \boldsymbol{r}_{m-n}) \boldsymbol{v}_n

所以最後又變回 O(N2)O (N^2)

實作#

基本定義#

輸入序列:

SN={wi}i=1N\mathbb{S}_N = \{w_i\}_{i=1}^{N}

詞嵌入序列:

EN={xi}i=1N,xiRd\mathbb{E}_N = \{\boldsymbol{x}_i\}_{i=1}^{N}, \quad \boldsymbol{x}_i \in \mathbb{R}^d

Self-Attention#

qm=fq(xm,m)\boldsymbol{q}_m = f_q(\boldsymbol{x}_m, m)kn=fk(xn,n)\boldsymbol{k}_n = f_k(\boldsymbol{x}_n, n)vn=fv(xn,n)\boldsymbol{v}_n = f_v(\boldsymbol{x}_n, n)am,n=exp(qmkn/d)j=1Nexp(qmkj/d)a_{m,n} = \frac{\exp(\boldsymbol{q}_m^\top \boldsymbol{k}_n / \sqrt{d})}{\sum_{j=1}^{N} \exp(\boldsymbol{q}_m^\top \boldsymbol{k}_j / \sqrt{d})}om=n=1Nam,nvn\boldsymbol{o}_m = \sum_{n=1}^{N} a_{m,n} \boldsymbol{v}_n

RoPE: 將 Query 與 Key 的內積轉為只有相對位置資訊:

fq(xm,m),fk(xn,n)=g(xm,xn,mn)\langle f_q(\boldsymbol{x}_m, m), f_k(\boldsymbol{x}_n, n) \rangle = g(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n)

推導#

二維空間#

用複數的基本性質可以求得:

fq(xm,m)=(Wqxm)eimθf_q(\boldsymbol{x}_m, m) = (W_q \boldsymbol{x}_m) e^{im\theta}fk(xn,n)=(Wkxn)einθf_k(\boldsymbol{x}_n, n) = (W_k \boldsymbol{x}_n) e^{in\theta}g(xm,xn,mn)=Re[(Wqxm)(Wkxn)ei(mn)θ]g(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n) = \mathrm{Re}[(W_q \boldsymbol{x}_m)(W_k \boldsymbol{x}_n)^* e^{i(m-n)\theta}]f{q,k}(xm,m)=(cosmθsinmθsinmθcosmθ)(W{q,k}(11)W{q,k}(12)W{q,k}(21)W{q,k}(22))(xm(1)xm(2))f_{\{q,k\}}(\boldsymbol{x}_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} W_{\{q,k\}}^{(11)} & W_{\{q,k\}}^{(12)} \\ W_{\{q,k\}}^{(21)} & W_{\{q,k\}}^{(22)} \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix}

一般化#

先把整個體分成 d/2d/2 個二維空間:

f{q,k}(xm,m)=RΘ,mdW{q,k}xmf_{\{q,k\}}(\boldsymbol{x}_m, m) = \boldsymbol{R}_{\Theta,m}^d W_{\{q,k\}} \boldsymbol{x}_m

旋轉矩陣:

RΘ,md=(cosmθ1sinmθ10000sinmθ1cosmθ1000000cosmθ2sinmθ20000sinmθ2cosmθ2000000cosmθd/2sinmθd/20000sinmθd/2cosmθd/2)\boldsymbol{R}_{\Theta,m}^d = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2} \end{pmatrix}

其中的預定義參數:

Θ={θi=100002(i1)/d,i[1,2,,d/2]}\Theta = \{\theta_i = 10000^{-2(i-1)/d}, i \in [1, 2, \ldots, d/2]\}

套用 Attention:

qmkn=(RΘ,mdWqxm)(RΘ,ndWkxn)=xmWqRΘ,nmdWkxn\boldsymbol{q}_m^\top \boldsymbol{k}_n = (\boldsymbol{R}_{\Theta,m}^d W_q \boldsymbol{x}_m)^\top (\boldsymbol{R}_{\Theta,n}^d W_k \boldsymbol{x}_n) = \boldsymbol{x}_m^\top W_q \boldsymbol{R}_{\Theta,n-m}^d W_k \boldsymbol{x}_nRΘ,nmd=(RΘ,md)RΘ,nd\boldsymbol{R}_{\Theta,n-m}^d = (\boldsymbol{R}_{\Theta,m}^d)^\top \boldsymbol{R}_{\Theta,n}^d

效能優化#

利用旋轉矩陣稀疏性實作高效能計算:

RΘ,mdx=(x1x2x3x4xd1xd)(cosmθ1cosmθ1cosmθ2cosmθ2cosmθd/2cosmθd/2)+(x2x1x4x3xdxd1)(sinmθ1sinmθ1sinmθ2sinmθ2sinmθd/2sinmθd/2)\boldsymbol{R}_{\Theta,m}^d \boldsymbol{x} = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ \vdots \\ x_{d-1} \\ x_d \end{pmatrix} \otimes \begin{pmatrix} \cos m\theta_1 \\ \cos m\theta_1 \\ \cos m\theta_2 \\ \cos m\theta_2 \\ \vdots \\ \cos m\theta_{d/2} \\ \cos m\theta_{d/2} \end{pmatrix} + \begin{pmatrix} -x_2 \\ x_1 \\ -x_4 \\ x_3 \\ \vdots \\ -x_d \\ x_{d-1} \end{pmatrix} \otimes \begin{pmatrix} \sin m\theta_1 \\ \sin m\theta_1 \\ \sin m\theta_2 \\ \sin m\theta_2 \\ \vdots \\ \sin m\theta_{d/2} \\ \sin m\theta_{d/2} \end{pmatrix}

一些性質#

Long-term Decay#

(RΘ,mdWqxm)(RΘ,ndWkxn)=Re[i=0d/21q[2i:2i+1]k[2i:2i+1]ei(mn)θi](\boldsymbol{R}_{\Theta,m}^d W_q \boldsymbol{x}_m)^\top (\boldsymbol{R}_{\Theta,n}^d W_k \boldsymbol{x}_n) = \mathrm{Re}\left[\sum_{i=0}^{d/2-1} \boldsymbol{q}_{[2i:2i+1]} \boldsymbol{k}_{[2i:2i+1]}^* e^{i(m-n)\theta_i}\right]

相對距離增加時, 內積值會衰減, 與自然語言的性質很類似

與線性注意力#

標準注意力:

Attention(Q,K,V)m=n=1Nsim(qm,kn)vnn=1Nsim(qm,kn)\mathrm{Attention}(Q,K,V)_m = \frac{\sum_{n=1}^{N} \mathrm{sim}(\boldsymbol{q}_m, \boldsymbol{k}_n) \boldsymbol{v}_n}{\sum_{n=1}^{N} \mathrm{sim}(\boldsymbol{q}_m, \boldsymbol{k}_n)}

線性注意力:

Attention(Q,K,V)m=n=1Nϕ(qm)φ(kn)vnn=1Nϕ(qm)φ(kn)\mathrm{Attention}(Q,K,V)_m = \frac{\sum_{n=1}^{N} \phi(\boldsymbol{q}_m)^\top \varphi(\boldsymbol{k}_n) \boldsymbol{v}_n}{\sum_{n=1}^{N} \phi(\boldsymbol{q}_m)^\top \varphi(\boldsymbol{k}_n)}

RoPE 保持向量範數不變, 因此可以與線性注意力結合:

Attention(Q,K,V)m=n=1N(RΘ,mdϕ(qm))(RΘ,ndφ(kn))vnn=1Nϕ(qm)φ(kn)\mathrm{Attention}(Q,K,V)_m = \frac{\sum_{n=1}^{N} (\boldsymbol{R}_{\Theta,m}^d \phi(\boldsymbol{q}_m))^\top (\boldsymbol{R}_{\Theta,n}^d \varphi(\boldsymbol{k}_n)) \boldsymbol{v}_n}{\sum_{n=1}^{N} \phi(\boldsymbol{q}_m)^\top \varphi(\boldsymbol{k}_n)}

靈活性#

使用乘法方式編碼位置, 不需要預定義最大序列長度, 且天然支持外推到更長序列

正交性#

旋轉矩陣是正交矩陣, 可以確保編碼的數值穩定性

Reference#

RoPE: Rotray Position Embedding
https://blog.cyberangel.work/posts/ml-rope/
Author
Ethan Lai
Published at
2026-01-27
License
CC BY-NC-SA 4.0

Comments

Table of Contents