第 3 章将源码文件转换为抽象语法树 - 学习笔记

本文为《Learn LLVM 17》第 3 章的学习笔记，主要介绍如何为 tinylang（Modula-2 子集）实现完整的编译器前端，从语法定义到 AST 构建。

1. 本章概览

编译器分为前端和后端。本章实现 tinylang 的前端，将源语言转换为抽象语法树 (AST)，为后续代码生成奠定基础。

1.1 学习目标

主题	内容
语言定义	tinylang 语法（Modula-2 子集）
项目结构	模块化目录布局
输入管理	多文件处理、SourceMgr
用户信息	集中式诊断消息系统
词法分析	模块化 Lexer、TokenKind、关键字过滤
语法分析	递归下降解析器、错误恢复
语义分析	作用域、AST 构建、LLVM 风格 RTTI

2. tinylang 语言定义

2.1 示例程序：欧几里得最大公约数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


MODULE Gcd;

PROCEDURE GCD(a, b: INTEGER) : INTEGER;
VAR t: INTEGER;
BEGIN
  IF b = 0 THEN
    RETURN a;
  END;
  WHILE b # 0 DO
    t := a MOD b;
    a := b;
    b := t;
  END;
  RETURN a;
END GCD;

END Gcd.

2.2 完整语法规则

编译单元与块

1
2
3
4
5
6
7


compilationUnit
  : "MODULE" identifier ";" ( import )* block identifier "." ;

Import : ( "FROM" identifier )? "IMPORT" identList ";" ;

Block
  : ( declaration )* ( "BEGIN" statementSequence )? "END" ;

声明

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


declaration
  : "CONST" ( constantDeclaration ";" )*
  | "VAR" ( variableDeclaration ";" )*
  | procedureDeclaration ";" ;

constantDeclaration : identifier "=" expression ;

variableDeclaration : identList ":" qualident ;

qualident : identifier ( "." identifier )* ;
identList : identifier ( "," identifier)* ;

过程声明

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


procedureDeclaration
  : "PROCEDURE" identifier ( formalParameters )? ";"
    block identifier ;

formalParameters
  : "(" ( formalParameterList )? ")" ( ":" qualident )? ;

formalParameterList
  : formalParameter (";" formalParameter )* ;

formalParameter : ( "VAR" )? identList ":" qualident ;

语句

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


statementSequence
  : statement ( ";" statement )* ;

statement
  : qualident ( ":=" expression | ( "(" ( expList )? ")" )? )
  | ifStatement | whileStatement | "RETURN" ( expression )? ;

ifStatement
  : "IF" expression "THEN" statementSequence
    ( "ELSE" statementSequence )? "END" ;

whileStatement
  : "WHILE" expression "DO" statementSequence "END" ;

表达式

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


expList : expression ( "," expression )* ;

expression : simpleExpression ( relation simpleExpression )? ;

relation : "=" | "#" | "<" | "<=" | ">" | ">=" ;

simpleExpression : ( "+" | "-" )? term ( addOperator term )* ;
addOperator : "+" | "-" | "OR" ;

term : factor ( mulOperator factor )* ;
mulOperator : "*" | "/" | "DIV" | "MOD" | "AND" ;

factor
  : integer_literal | "(" expression ")" | "NOT" factor
  | qualident ( "(" ( expList )? ")" )? ;

词法规则

标识符：字母或下划线开头，后跟字母、数字、下划线
整数字面量：十进制数字序列，或后跟 H 的十六进制序列

3. 项目目录结构

3.1 组件布局

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


tinylang/
├── include/tinylang/
│   ├── Basic/          # 基础组件：TokenKind、Diagnostic、LLVM.h
│   ├── Lexer/          # 词法分析器
│   ├── Parser/         # 语法分析器
│   ├── AST/            # 抽象语法树
│   └── Sema/           # 语义分析器
├── lib/
│   ├── Basic/
│   ├── Lexer/
│   ├── Parser/
│   ├── AST/
│   └── Sema/
└── tools/driver/       # 驱动程序

3.2 依赖关系

1
2
3


Basic ← Lexer
Basic ← AST ← Sema
Basic ← Parser ← Lexer, AST, Sema

Lexer 仅依赖 Basic
Parser 依赖 Basic、Lexer、AST、Sema
Sema 依赖 Basic、AST

4. 管理编译器输入文件

4.1 LLVM SourceMgr

方法	用途
`AddNewSourceBuffer()`	添加新源文件缓冲区
`AddIncludeFile()`	加载文件

返回 SourceBufferID 标识缓冲区
llvm::SMLoc 封装源位置

4.2 错误输出

PrintMessage() 向用户输出错误/警告
支持行号、列号标记

5. 处理用户信息（Diagnostic 系统）

5.1 设计原则

消息集中定义，便于修改和翻译
三元组：ID、严重级别、消息文本

5.2 Diagnostic.def 格式

1
2
3
4
5
6
7
8
9


#ifndef DIAG
#define DIAG(ID, Level, Msg)
#endif

DIAG(err_symbold_declared, Error, "symbol {0} already declared")
DIAG(err_vardecl_requires_type, Error, "variable declaration requires type")
// ...

#undef DIAG

{0}、{1} 为占位符，支持 formatv() 可变参数

5.3 DiagnosticsEngine 类

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// 头文件：枚举生成
namespace diag {
enum {
#define DIAG(ID, Level, Msg) ID,
#include "tinylang/Basic/Diagnostic.def"
};
}

// 核心方法
template <typename... Args>
void report(SMLoc Loc, unsigned DiagID, Args &&... Arguments);

使用 llvm::formatv() 格式化消息
统计 NumErrors

5.4 实现文件中的宏使用

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// 消息文本数组
const char *DiagnosticText[] = {
#define DIAG(ID, Level, Msg) Msg,
#include "tinylang/Basic/Diagnostic.def"
};

// 严重级别数组
SourceMgr::DiagKind DiagnosticKind[] = {
#define DIAG(ID, Level, Msg) SourceMgr::DK_##Level,
#include "tinylang/Basic/Diagnostic.def"
};

6. 词法分析器 (Lexer)

6.1 TokenKind 定义

TokenKinds.def 三类标记：

宏	用途	示例
`TOK(ID)`	通用标记	`unknown`, `eof`, `identifier`
`PUNCTUATOR(ID, SP)`	标点符号	`PUNCTUATOR(plus, "+")`
`KEYWORD(ID, FLAG)`	关键字	`KEYWORD(BEGIN, KEYALL)`

关键字前缀 kw_，如 kw_BEGIN。

6.2 枚举生成

1
2
3
4
5
6
7


namespace tok {
enum TokenKind : unsigned short {
#define TOK(ID) ID,
#include "tinylang/Basic/TokenKinds.def"
  NUM_TOKENS
};
}

6.3 辅助函数

getTokenName()：标记名称
getPunctuatorSpelling()：标点拼写
getKeywordSpelling()：关键字拼写

6.4 关键字过滤器

1
2
3
4
5
6


class KeywordFilter {
  llvm::StringMap<tok::TokenKind> HashTable;
public:
  void addKeywords();  // 从 .def 填充
  tok::TokenKind getKeyword(StringRef Name, tok::TokenKind DefaultTokenCode);
};

标识符与关键字同属一类，先匹配标识符再查表判断是否为关键字
KEYALL 标志支持语言级别过滤

6.5 多字符操作符

1
2
3
4
5
6


case '<':
  if (*(CurPtr + 1) == '=')
    formTokenWithChars(token, CurPtr + 2, tok::lessequal);
  else
    formTokenWithChars(token, CurPtr + 1, tok::less);
  break;

7. 递归下降语法分析器

7.1 语法规则到 C++ 的映射

语法结构	C++ 实现
非终结符	调用对应 `parseXXX()`
终结符	`consume(tok::xxx)`
可选 `(A)?`	`if (Tok.is(tok::xxx)) { advance(); }`
重复 `(A)*`	`while (...) { parseA(); }`
选择 `A \| B`	`if/switch` 检查超前标记

7.2 示例：parseIfStatement

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


void Parser::parseIfStatement() {
  consume(tok::kw_IF);
  parseExpression();
  consume(tok::kw_THEN);
  parseStatementSequence();
  if (Tok.is(tok::kw_ELSE)) {
    advance();
    parseStatementSequence();
  }
  consume(tok::kw_END);
}

7.3 解析器类型

类型	展开顺序	名称
自顶向下	最左符号先展开	LL 解析器
自底向上	最右符号先展开	LR 解析器

tinylang 使用递归下降，属于 LL 解析器。

7.4 左递归问题

错误示例：

1

expression : expression "+" term ;

会导致无限递归：

1
2
3
4
5


void parseExpression() {
  parseExpression();  // 无限递归！
  consume(tok::plus);
  parseTerm();
}

解决：改写语法消除左递归，或使用算法检测并消除。

注意：左递归只影响 LL 解析器；LR 解析器需避免右递归。

7.5 语法冲突

C# using 示例：

1

usingStmt : "using" (ident "=")? ident ";"  // 冲突：可选组与后续标识符重叠

解决策略：

重写语法：usingStmt : "using" ident ("=" ident)? ";"
谓词：peek(0) 等额外上下文
回溯：保存状态尝试，失败则恢复（效率低）

7.6 错误恢复：恐慌模式

为每个非终结符定义 FOLLOW 集合
出错时跳过标记直到遇到 FOLLOW 中的符号
skipUntil() 可跳过到指定标记集合

7.7 带错误恢复的 parseIfStatement

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


bool Parser::parseIfStatement() {
  auto _errorhandler = [this] {
    return skipUntil(tok::semi, tok::kw_ELSE, tok::kw_END);
  };
  if (consume(tok::kw_IF)) return _errorhandler();
  if (parseExpression(E)) return _errorhandler();
  if (consume(tok::kw_THEN)) return _errorhandler();
  if (parseStatementSequence(IfStmts)) return _errorhandler();
  if (Tok.is(tok::kw_ELSE)) {
    advance();
    if (parseStatementSequence(ElseStmts)) return _errorhandler();
  }
  if (expect(tok::kw_END)) return _errorhandler();
  return false;
}

返回 true 表示错误恢复未完成
Lambda 集中处理清理逻辑

7.8 解析器/词法分析器生成器

工具	用途	特点
flex + bison	词法 + LALR 解析	经典，语法广泛
ANTLR	词法 + LALR 解析 + AST	LL(*)，多语言
Coco/R	LL(1) 词法 + 递归下降	自包含，生成简单

8. 语义分析

8.1 语义分析器职责

检查名称重复声明
检查名称使用前已声明
计算表达式类型
检查类型兼容性（赋值、形参、IF/WHILE 条件）
判断表达式是否为常量

8.2 名称作用域

使用前声明规则：tinylang 采用 declare-before-use。

1
2
3
4
5
6
7


VAR B, X: INTEGER;

PROCEDURE Proc;
VAR B: BOOLEAN;  -- 局部 B 遮蔽全局 B
BEGIN
  (* 此处 B 指 BOOLEAN，X 指 INTEGER *)
END Proc;

8.3 Scope 实现

1
2
3
4
5
6
7


class Scope {
  Scope *Parent;
  StringMap<Decl *> Symbols;
public:
  bool insert(Decl *Declaration);   // 插入，失败返回 false
  Decl *lookup(StringRef Name);      // 沿父链查找
};

lookup 从当前作用域向父作用域递归查找

8.4 LLVM 风格 RTTI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


class Decl {
public:
  enum DeclKind { DK_Module, DK_Const, DK_Type, DK_Var, DK_Param, DK_Proc };
private:
  const DeclKind Kind;
public:
  DeclKind getKind() const { return Kind; }
};

// 子类
static bool classof(const Decl *D) {
  return D->getKind() == DK_Var;
}

使用 llvm::isa<>、llvm::dyn_cast<> 类型检查与转换
避免 C++ RTTI 开销

8.5 AST 节点层次

声明 (Decl)：

ModuleDeclaration
ConstantDeclaration
TypeDeclaration
VariableDeclaration
FormalParameterDeclaration
ProcedureDeclaration

表达式 (Expr)：

InfixExpression、PrefixExpression
IntegerLiteral、BooleanLiteral
VariableAccess、ConstantAccess
FunctionCallExpr

语句 (Stmt)：

AssignmentStatement
ProcedureCallStatement
IfStatement、WhileStatement
ReturnStatement

8.6 变量声明语义动作

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


void Sema::actOnVariableDeclaration(DeclList &Decls, IdentList &Ids, Decl *D) {
  if (TypeDeclaration *Ty = dyn_cast<TypeDeclaration>(D)) {
    for (auto &[Loc, Name] : Ids) {
      auto *Decl = new VariableDeclaration(CurrentDecl, Loc, Name, Ty);
      if (CurrentScope->insert(Decl))
        Decls.push_back(Decl);
      else
        Diags.report(Loc, diag::err_symbold_declared, Name);
    }
  } else if (!Ids.empty()) {
    Diags.report(Loc, diag::err_vardecl_requires_type);
  }
}

8.7 作用域管理（RAII）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


void Sema::enterScope(Decl *D) {
  CurrentScope = new Scope(CurrentScope);
  CurrentDecl = D;
}

void Sema::leaveScope() {
  Scope *Parent = CurrentScope->getParent();
  delete CurrentScope;
  CurrentScope = Parent;
  CurrentDecl = CurrentDecl->getEnclosingDecl();
}

// RAII 辅助类
class EnterDeclScope {
  Sema &Semantics;
public:
  EnterDeclScope(Sema &Semantics, Decl *D) : Semantics(Semantics) {
    Semantics.enterScope(D);
  }
  ~EnterDeclScope() { Semantics.leaveScope(); }
};

8.8 过程声明两阶段

解析阶段：解析名称后立即 actOnProcedureDeclaration() 创建 AST 节点并插入作用域
完成阶段：解析完声明和语句后调用 actOnProcedureDeclaration() 完成，设置参数、返回类型、声明列表、语句列表

8.9 全局作用域初始化

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


void Sema::initialize() {
  CurrentScope = new Scope();
  IntegerType = new TypeDeclaration(...);
  BooleanType = new TypeDeclaration(...);
  TrueLiteral = new BooleanLiteral(true, BooleanType);
  FalseLiteral = new BooleanLiteral(false, BooleanType);
  TrueConst = new ConstantDeclaration(..., "TRUE", TrueLiteral);
  FalseConst = new ConstantDeclaration(..., "FALSE", FalseLiteral);
  CurrentScope->insert(IntegerType);
  CurrentScope->insert(BooleanType);
  // ...
}

8.10 属性计算

继承属性：来自调用者（如作用域）
综合属性：由子节点计算（如类型）
tinylang 的 declare-before-use 允许在构造 AST 时完成大部分语义计算

9. 驱动程序

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


int main(int argc_, const char **argv_) {
  llvm::InitLLVM X(argc_, argv_);
  for (const char *F : argv) {
    auto FileOrErr = llvm::MemoryBuffer::getFile(F);
    // ...
    llvm::SourceMgr SrcMgr;
    DiagnosticsEngine Diags(SrcMgr);
    SrcMgr.AddNewSourceBuffer(std::move(*FileOrErr), llvm::SMLoc());
    auto TheLexer = Lexer(SrcMgr, Diags);
    auto TheSema = Sema(Diags);
    auto TheParser = Parser(TheLexer, TheSema);
    TheParser.parse();
  }
}

9.1 运行

1
2


$ tinylang Gcd.mod
Tinylang 0.1

正确程序无输出；可故意引入错误以触发诊断。

10. 总结

10.1 前端流程

1
2
3


源文件 → SourceMgr → Lexer → Token 流 → Parser → AST
                ↑                    ↑
           DiagnosticsEngine      Sema

10.2 关键技术点

组件	技术
Basic	.def 宏、Diagnostic 枚举、TokenKind
Lexer	手写 Lexer、KeywordFilter、多字符 Token
Parser	递归下降、恐慌模式、Lambda 错误处理
Sema	作用域链、LLVM RTTI、RAII 作用域
AST	声明/表达式/语句层次、classof

10.3 与 calc 的对比

方面	calc	tinylang
语法	简单表达式	完整 Modula-2 子集
结构	单文件	多组件、模块化
诊断	分散	集中 Diagnostic.def
作用域	StringSet	Scope 链 + 父指针
AST	访问者模式	LLVM RTTI + dyn_cast

10.4 后续

第 4 章将基于本章的 AST 完成 IR 代码生成 和汇编输出。

第 3 章 将源码文件转换为抽象语法树 - 学习笔记