Single-Issue RISC-V Out-of-Order Pipeline Processor
verified in simulationRV32IM 单发射乱序流水线处理器
Single-issue Tomasulo RV32IM with reorder buffer, branch prediction, and split L1 cache hierarchy.
- Role
- Course project, UIUC ECE Computer Architecture
- Dates
- Sep 2024 – Dec 2024
- CPU
- RTL
- Tomasulo
- Cache
- RISC-V
UIUC 课程项目(09–12 / 2024)。CPU 设计岗 / 体系结构岗的核心讲项。
← 项目索引
一句话总结
设计、实现并验证 RV32IM 单发射乱序处理器,基于 Tomasulo 算法的 5 级流水线,集成 4-way set-associative I/D-Cache 和 BTB+RAT 分支预测,IPC 提升 ~18%,分支预测准确率 88%。
简历描述(原文)
Single-Issue RISC-V Out-of-Order Pipeline Processor Design | 09/2024 – 12/2024
Project Overview: Design, implement, and verify a RISC-V RV32IM single-issue out-of-order processor with a Tomasulo-based five-stage pipeline.
- Achieve dynamic scheduling and data hazard resolution via Tomasulo-based register renaming and RVS.
- Implement a Harvard architecture with 4-way set-associative I-Cache and D-Cache, featuring write-back + write-allocate policies and PLRU replacement.
- Integrate branch prediction with BTB and RAT, achieving 88% accuracy and ~18% IPC improvement.
- Introduce performance counters and apply the Top-Down analysis methodology to identify pipeline bottlenecks, optimizing load and branch latency impacts.
关键亮点
- ✅ Tomasulo 完整实现(register renaming + RVS + CDB)
- ✅ 完整的 cache 子系统(4-way + WB/WA + PLRU)
- ✅ 分支预测系统(BTB + RAT,88% accuracy,18% IPC↑)
- ✅ 性能分析方法论(Top-Down analysis,Intel 经典方法)
- ✅ 覆盖体系结构核心(对应 CS 433 第 3-11 讲)
面试优先级
⭐⭐⭐⭐⭐ — CPU 设计 / 体系结构 / AI 加速器岗位主推。能展示从架构到 RTL 的完整能力。
进度
- 整体架构图(白板能默画)
- Tomasulo 数据流详解(RAT / RVS / CDB)
- Cache 设计细节(PLRU 算法、WB/WA 时序)
- BTB + RAT 协作机制
- Top-Down 分析过程 + 数据
- 30 个深挖问答
- 英文版三档 STAR
- 关键决策 / trade-off
- 性能数据(IPC、accuracy、benchmark)
入口
- 01_技术细节
- 02_深挖问答
- 03_关键决策
- 04_数据指标
- 05_英文版讲解
Key Decisions
决策 1:Tomasulo vs Scoreboard
选了 Tomasulo
- 优势:register renaming → 自动消 WAR/WAW,不只能 stall
- 代价:RVS / RAT / Free List / phys regfile 实现复杂
决策 2:单发射 vs 多发射
选了单发射
- 理由:课程时间有限,单发射先打通 baseline
- 影响:无法利用 superscalar ILP,但 OoO 仍能利用单发射 OoO ILP
- 如果重做:做双发射,issue width 翻倍,改 RVS 多 read port,CDB 拓宽
决策 3:Cache 写策略 - WB + WA
选了 Write-back + Write-allocate
- 理由:减少 memory traffic(WB)+ 局部性 reuse(WA)
- 代价:dirty bit / writeback buffer / coherence 复杂度高
- 替代:WT + No-WA(简单但带宽差),WT + WA(coherence 友好)
决策 4:PLRU vs True LRU
选了 PLRU
- 理由:4-way 用 3-bit 实现,True LRU 需 ~5-bit + 复杂逻辑
- 代价:命中率 1-2% 损失
- 现实:绝大多数商用 cache 用 PLRU 或更简化的 random / FIFO
决策 5:分支预测复杂度 - 简单 BTB 不上 TAGE
选了 BTB + (RAT/RAS)
- 理由:scope 控制,优先打通 baseline
- 代价:88% accuracy,gshare/TAGE 可达 95%+
- 改进路径:加 GHR + 2-bit counter table → 升级 gshare
决策 6:精确异常 - 简化版还是完整版
选了 (待确认)
- 完整版:ROB 做 in-order retire,异常时清空后续
- 简化版:可能跳过部分异常处理
- (待填:你的项目实际选了哪个)
决策 7:Top-Down 方法论
选了 Intel Top-Down 而不是简单看 IPC
- 理由:IPC 只看结果,Top-Down 知道为什么慢
- 代价:需要加多个 counter,接读出协议
- 价值:科学定位 bottleneck → 18% IPC 改进有据可依
决策 8:Memory Disambiguation 程度
(待确认你的项目处理到什么程度)
- No-speculation:load 必须等所有前面 store 完成 → 简单 + 慢
- Conservative speculation:load 等地址已知的前 store
- Aggressive speculation:load 直接发,猜不冲突,猜错就 squash
- (待填:你选了哪种?为什么?)
决策 9:RV32IM M extension
MUL / DIV 实现
- (待填:几 cycle?Booth-encoded?Restoring 还是 Non-restoring division?)
- Trade-off:cycle 数 vs 面积
TODO
- 确认每个决策项目实际选择
- 补充每个 trade-off 的具体数据(cycle / area / power)
Metrics
简历声明的数字
| 指标 | 数值 | 备注 |
|---|---|---|
| 分支预测准确率 | 88% | 待确认 benchmark |
| IPC 提升 | ~18% | vs baseline 待确认 |
| Cache 关联度 | 4-way | I + D 各一 |
| 流水线深度 | 5 级 | IF/ID/IS/EX/WB |
必须能回答的 follow-up
88% accuracy
- 用什么 benchmark 测的?
- SPEC2017 mini? coremark? custom microbench?
- 跟什么 baseline 对比?(static taken / static not-taken / 上一版?)
- 测试样本量多大?(指令数 / 分支数)
18% IPC improvement
- vs 什么 baseline?
- 顺序流水线?
- 没有 BTB 的 OoO?
- 没有 cache 的 OoO?
- benchmark 是什么?
- 优化的具体哪些点?(分支预测?cache?both?)
待补充的细节数据
性能数据
- 平均 IPC:
- 各 benchmark IPC:
- Cache 命中率(I-Cache / D-Cache):
- BTB 命中率:
- mispred 平均 penalty(cycles):
资源数据(若综合过)
- 综合工艺:
- 频率:
- 面积:
- 各模块面积占比:
设计规模
- RTL 总行数:
- Module 数:
- 测试 testcase 数:
Top-Down 数据(按需填)
| 类别 | 占比 | 说明 |
|---|---|---|
| Frontend Bound | (待填) % | I-cache miss / BTB miss |
| Bad Speculation | (待填) % | branch mispred 浪费 |
| Backend Bound | (待填) % | execution / memory stall |
| Retiring | (待填) % | 有效完成(目标最大化) |
量化讲法模板
✅ “On the Coremark benchmark, baseline static taken predictor achieved 70% accuracy with 1.0 IPC; my BTB+RAT design pushed accuracy to 88% and IPC to 1.18, a 18% improvement.”
✅ “Top-Down analysis revealed 30% Bad Speculation in baseline; after BTB tuning it dropped to 12%, contributing most of the IPC gain.”
❌ “The accuracy is pretty good, around 88%.”(模糊,且不知道对标什么)
TODO
- 找到原始测试数据(rerun simulation 拿真实数字)
- 88% / 18% 的 baseline 和 benchmark 写清楚
- Top-Down 各类比例填实数
- 跟商用 / 开源 RISC-V 比对(BOOM / Rocket / SiFive)
STAR Narratives (English)
关键术语对照
| 中文 | English |
|---|---|
| 乱序执行 | Out-of-order Execution(OoO) |
| 寄存器重命名 | Register Renaming |
| 物理寄存器 | Physical Register |
| 架构寄存器 | Architectural Register |
| 保留站 | Reservation Station |
| 重排序缓冲器 | Reorder Buffer(ROB) |
| 公共数据总线 | Common Data Bus(CDB) |
| 数据冒险 | Data Hazard(RAW / WAR / WAW) |
| 分支预测 | Branch Prediction |
| 分支目标缓存 | Branch Target Buffer(BTB) |
| 错误预测 | Misprediction |
| 流水线冲刷 | Pipeline Squash / Flush |
| 内存歧义消解 | Memory Disambiguation |
| 写回 / 写分配 | Write-back / Write-allocate |
| 替换算法 | Replacement Policy(PLRU) |
| 性能计数器 | Performance Counter |
| 自顶向下分析 | Top-Down Analysis |
| 前端 / 后端瓶颈 | Frontend Bound / Backend Bound |
| 推测执行 | Speculative Execution |
| 精确异常 | Precise Exception |
30-Second Elevator Pitch
For my computer architecture course at UIUC, I designed and implemented a single-issue
out-of-order RISC-V processor supporting RV32IM. The core uses Tomasulo-based register
renaming with a five-stage pipeline, integrates a 4-way set-associative I-cache and
D-cache with write-back and write-allocate policies, and includes branch prediction with
a BTB. I also added performance counters and applied Top-Down analysis to identify
bottlenecks, which led to about 18% IPC improvement and 88% branch prediction accuracy.
2-Minute Standard Version(STAR)
Situation
This was my computer architecture project at UIUC, where I had a full semester to
build a working out-of-order RISC-V processor from scratch in SystemVerilog. The
goal was not just to make it work, but to apply real architecture techniques like
register renaming and Top-Down analysis.
Task
I designed an RV32IM single-issue out-of-order core with a five-stage pipeline using
the Tomasulo algorithm, plus a complete cache subsystem and branch prediction.
Action
I implemented register renaming with a Rename Alias Table, a Free List, and physical
register file. The Reservation Station holds in-flight instructions and snoops the
Common Data Bus to wake up dependent ops. For caches, I built 4-way set-associative
I- and D-caches with write-back, write-allocate, and PLRU replacement. For branch
prediction, I integrated a BTB plus return-address handling. The most analytical part
was applying Intel's Top-Down methodology — I added performance counters to classify
each cycle as Frontend Bound, Bad Speculation, Backend Bound, or Retiring.
Result
The Top-Down analysis pointed me to load latency and branch misprediction as the two
biggest bottlenecks. After tuning the BTB and the load handling, IPC improved by
about 18% over the baseline, with branch prediction accuracy reaching 88%.
15-30 Minute Deep-Dive Version
现场白板默画 + 详讲。要能 even at 30 min not run out of material.
大纲
- Project Overview(2 min):RV32IM, OoO motivation
- Pipeline & Tomasulo(8 min):
- 白板画框图
- Walk through
add x1, x2, x3全过程 - RAT / RVS / CDB / phys regfile 协作
- Cache Subsystem(5 min):
- 4-way + WB + WA + PLRU
- PLRU 算法演示(画 binary tree)
- hit/miss timing
- Branch Prediction(4 min):
- BTB 工作机制
- misprediction recovery
- Top-Down Analysis(6 min):
- 四类定义
- 你加的 counter
- 用 Top-Down 找到 bottleneck 的过程
- 优化前后数据对比
- Lessons Learned(3 min):
- 难点(memory disambiguation? bug story?)
- 重做会怎么改
高频英文 Q&A
Q: Walk me through the renaming flow for add x1, x2, x3.
A: (待填,英文版完整流程)
Q: How does your design handle a branch misprediction?
A: (待填,英文版 recovery 流程)
Q: What’s the trade-off between PLRU and true LRU?
A: (待填)
Q: What was your baseline when you measured the 18% IPC improvement?
A: (待填,必问的 follow-up)
Q: Did you implement memory disambiguation? Why or why not?
A: (待填)
录音 / 模拟练习清单
- 30-second pitch 录 ≥ 5 次,无填充词
- 2-minute STAR 录 ≥ 5 次
- 默画架构图同时口头讲(英文)≥3 次
- 30 分钟 deep dive 完整 1 次