Skip to content

Single-Issue RISC-V Out-of-Order Pipeline Processor

verified in simulation

RV32IM 单发射乱序流水线处理器

Single-issue Tomasulo RV32IM with reorder buffer, branch prediction, and split L1 cache hierarchy.

Role
Course project, UIUC ECE Computer Architecture
Dates
Sep 2024 – Dec 2024
  • CPU
  • RTL
  • Tomasulo
  • Cache
  • RISC-V

UIUC 课程项目(09–12 / 2024)。CPU 设计岗 / 体系结构岗的核心讲项

← 项目索引

一句话总结

设计、实现并验证 RV32IM 单发射乱序处理器,基于 Tomasulo 算法的 5 级流水线,集成 4-way set-associative I/D-Cache 和 BTB+RAT 分支预测,IPC 提升 ~18%,分支预测准确率 88%。

简历描述(原文)

Single-Issue RISC-V Out-of-Order Pipeline Processor Design | 09/2024 – 12/2024

Project Overview: Design, implement, and verify a RISC-V RV32IM single-issue out-of-order processor with a Tomasulo-based five-stage pipeline.

  • Achieve dynamic scheduling and data hazard resolution via Tomasulo-based register renaming and RVS.
  • Implement a Harvard architecture with 4-way set-associative I-Cache and D-Cache, featuring write-back + write-allocate policies and PLRU replacement.
  • Integrate branch prediction with BTB and RAT, achieving 88% accuracy and ~18% IPC improvement.
  • Introduce performance counters and apply the Top-Down analysis methodology to identify pipeline bottlenecks, optimizing load and branch latency impacts.

关键亮点

  • Tomasulo 完整实现(register renaming + RVS + CDB)
  • 完整的 cache 子系统(4-way + WB/WA + PLRU)
  • 分支预测系统(BTB + RAT,88% accuracy,18% IPC↑)
  • 性能分析方法论(Top-Down analysis,Intel 经典方法)
  • 覆盖体系结构核心(对应 CS 433 第 3-11 讲)

面试优先级

⭐⭐⭐⭐⭐ — CPU 设计 / 体系结构 / AI 加速器岗位主推。能展示从架构到 RTL 的完整能力。

进度

  • 整体架构图(白板能默画)
  • Tomasulo 数据流详解(RAT / RVS / CDB)
  • Cache 设计细节(PLRU 算法、WB/WA 时序)
  • BTB + RAT 协作机制
  • Top-Down 分析过程 + 数据
  • 30 个深挖问答
  • 英文版三档 STAR
  • 关键决策 / trade-off
  • 性能数据(IPC、accuracy、benchmark)

入口

  • 01_技术细节
  • 02_深挖问答
  • 03_关键决策
  • 04_数据指标
  • 05_英文版讲解

Key Decisions

决策 1:Tomasulo vs Scoreboard

选了 Tomasulo

  • 优势:register renaming → 自动消 WAR/WAW,不只能 stall
  • 代价:RVS / RAT / Free List / phys regfile 实现复杂

决策 2:单发射 vs 多发射

选了单发射

  • 理由:课程时间有限,单发射先打通 baseline
  • 影响:无法利用 superscalar ILP,但 OoO 仍能利用单发射 OoO ILP
  • 如果重做:做双发射,issue width 翻倍,改 RVS 多 read port,CDB 拓宽

决策 3:Cache 写策略 - WB + WA

选了 Write-back + Write-allocate

  • 理由:减少 memory traffic(WB)+ 局部性 reuse(WA)
  • 代价:dirty bit / writeback buffer / coherence 复杂度高
  • 替代:WT + No-WA(简单但带宽差),WT + WA(coherence 友好)

决策 4:PLRU vs True LRU

选了 PLRU

  • 理由:4-way 用 3-bit 实现,True LRU 需 ~5-bit + 复杂逻辑
  • 代价:命中率 1-2% 损失
  • 现实:绝大多数商用 cache 用 PLRU 或更简化的 random / FIFO

决策 5:分支预测复杂度 - 简单 BTB 不上 TAGE

选了 BTB + (RAT/RAS)

  • 理由:scope 控制,优先打通 baseline
  • 代价:88% accuracy,gshare/TAGE 可达 95%+
  • 改进路径:加 GHR + 2-bit counter table → 升级 gshare

决策 6:精确异常 - 简化版还是完整版

选了 (待确认)

  • 完整版:ROB 做 in-order retire,异常时清空后续
  • 简化版:可能跳过部分异常处理
  • (待填:你的项目实际选了哪个)

决策 7:Top-Down 方法论

选了 Intel Top-Down 而不是简单看 IPC

  • 理由:IPC 只看结果,Top-Down 知道为什么慢
  • 代价:需要加多个 counter,接读出协议
  • 价值:科学定位 bottleneck → 18% IPC 改进有据可依

决策 8:Memory Disambiguation 程度

(待确认你的项目处理到什么程度)

  • No-speculation:load 必须等所有前面 store 完成 → 简单 + 慢
  • Conservative speculation:load 等地址已知的前 store
  • Aggressive speculation:load 直接发,猜不冲突,猜错就 squash
  • (待填:你选了哪种?为什么?)

决策 9:RV32IM M extension

MUL / DIV 实现

  • (待填:几 cycle?Booth-encoded?Restoring 还是 Non-restoring division?)
  • Trade-off:cycle 数 vs 面积

TODO

  • 确认每个决策项目实际选择
  • 补充每个 trade-off 的具体数据(cycle / area / power)

Metrics

简历声明的数字

指标数值备注
分支预测准确率88%待确认 benchmark
IPC 提升~18%vs baseline 待确认
Cache 关联度4-wayI + D 各一
流水线深度5 级IF/ID/IS/EX/WB

必须能回答的 follow-up

88% accuracy

  • 用什么 benchmark 测的?
    • SPEC2017 mini? coremark? custom microbench?
  • 跟什么 baseline 对比?(static taken / static not-taken / 上一版?)
  • 测试样本量多大?(指令数 / 分支数)

18% IPC improvement

  • vs 什么 baseline?
    • 顺序流水线?
    • 没有 BTB 的 OoO?
    • 没有 cache 的 OoO?
  • benchmark 是什么?
  • 优化的具体哪些点?(分支预测?cache?both?)

待补充的细节数据

性能数据

  • 平均 IPC:
  • 各 benchmark IPC:
  • Cache 命中率(I-Cache / D-Cache):
  • BTB 命中率:
  • mispred 平均 penalty(cycles):

资源数据(若综合过)

  • 综合工艺:
  • 频率:
  • 面积:
  • 各模块面积占比:

设计规模

  • RTL 总行数:
  • Module 数:
  • 测试 testcase 数:

Top-Down 数据(按需填)

类别占比说明
Frontend Bound(待填) %I-cache miss / BTB miss
Bad Speculation(待填) %branch mispred 浪费
Backend Bound(待填) %execution / memory stall
Retiring(待填) %有效完成(目标最大化)

量化讲法模板

✅ “On the Coremark benchmark, baseline static taken predictor achieved 70% accuracy with 1.0 IPC; my BTB+RAT design pushed accuracy to 88% and IPC to 1.18, a 18% improvement.”

✅ “Top-Down analysis revealed 30% Bad Speculation in baseline; after BTB tuning it dropped to 12%, contributing most of the IPC gain.”

❌ “The accuracy is pretty good, around 88%.”(模糊,且不知道对标什么)

TODO

  • 找到原始测试数据(rerun simulation 拿真实数字)
  • 88% / 18% 的 baseline 和 benchmark 写清楚
  • Top-Down 各类比例填实数
  • 跟商用 / 开源 RISC-V 比对(BOOM / Rocket / SiFive)

STAR Narratives (English)

关键术语对照

中文English
乱序执行Out-of-order Execution(OoO)
寄存器重命名Register Renaming
物理寄存器Physical Register
架构寄存器Architectural Register
保留站Reservation Station
重排序缓冲器Reorder Buffer(ROB)
公共数据总线Common Data Bus(CDB)
数据冒险Data Hazard(RAW / WAR / WAW)
分支预测Branch Prediction
分支目标缓存Branch Target Buffer(BTB)
错误预测Misprediction
流水线冲刷Pipeline Squash / Flush
内存歧义消解Memory Disambiguation
写回 / 写分配Write-back / Write-allocate
替换算法Replacement Policy(PLRU)
性能计数器Performance Counter
自顶向下分析Top-Down Analysis
前端 / 后端瓶颈Frontend Bound / Backend Bound
推测执行Speculative Execution
精确异常Precise Exception

30-Second Elevator Pitch

For my computer architecture course at UIUC, I designed and implemented a single-issue
out-of-order RISC-V processor supporting RV32IM. The core uses Tomasulo-based register
renaming with a five-stage pipeline, integrates a 4-way set-associative I-cache and
D-cache with write-back and write-allocate policies, and includes branch prediction with
a BTB. I also added performance counters and applied Top-Down analysis to identify
bottlenecks, which led to about 18% IPC improvement and 88% branch prediction accuracy.

2-Minute Standard Version(STAR)

Situation

This was my computer architecture project at UIUC, where I had a full semester to
build a working out-of-order RISC-V processor from scratch in SystemVerilog. The
goal was not just to make it work, but to apply real architecture techniques like
register renaming and Top-Down analysis.

Task

I designed an RV32IM single-issue out-of-order core with a five-stage pipeline using
the Tomasulo algorithm, plus a complete cache subsystem and branch prediction.

Action

I implemented register renaming with a Rename Alias Table, a Free List, and physical
register file. The Reservation Station holds in-flight instructions and snoops the
Common Data Bus to wake up dependent ops. For caches, I built 4-way set-associative
I- and D-caches with write-back, write-allocate, and PLRU replacement. For branch
prediction, I integrated a BTB plus return-address handling. The most analytical part
was applying Intel's Top-Down methodology — I added performance counters to classify
each cycle as Frontend Bound, Bad Speculation, Backend Bound, or Retiring.

Result

The Top-Down analysis pointed me to load latency and branch misprediction as the two
biggest bottlenecks. After tuning the BTB and the load handling, IPC improved by
about 18% over the baseline, with branch prediction accuracy reaching 88%.

15-30 Minute Deep-Dive Version

现场白板默画 + 详讲。要能 even at 30 min not run out of material.

大纲

  1. Project Overview(2 min):RV32IM, OoO motivation
  2. Pipeline & Tomasulo(8 min):
    • 白板画框图
    • Walk through add x1, x2, x3 全过程
    • RAT / RVS / CDB / phys regfile 协作
  3. Cache Subsystem(5 min):
    • 4-way + WB + WA + PLRU
    • PLRU 算法演示(画 binary tree)
    • hit/miss timing
  4. Branch Prediction(4 min):
    • BTB 工作机制
    • misprediction recovery
  5. Top-Down Analysis(6 min):
    • 四类定义
    • 你加的 counter
    • 用 Top-Down 找到 bottleneck 的过程
    • 优化前后数据对比
  6. Lessons Learned(3 min):
    • 难点(memory disambiguation? bug story?)
    • 重做会怎么改

高频英文 Q&A

Q: Walk me through the renaming flow for add x1, x2, x3.

A: (待填,英文版完整流程)

Q: How does your design handle a branch misprediction?

A: (待填,英文版 recovery 流程)

Q: What’s the trade-off between PLRU and true LRU?

A: (待填)

Q: What was your baseline when you measured the 18% IPC improvement?

A: (待填,必问的 follow-up)

Q: Did you implement memory disambiguation? Why or why not?

A: (待填)

录音 / 模拟练习清单

  • 30-second pitch 录 ≥ 5 次,无填充词
  • 2-minute STAR 录 ≥ 5 次
  • 默画架构图同时口头讲(英文)≥3 次
  • 30 分钟 deep dive 完整 1 次