第八章:实战 - 构建 RAG 评估 Harness
RAG(检索增强生成)系统有独特的评估需求,本章专门讲解 RAG Harness 的设计。
RAG 评估的特殊性
RAG 架构理解
graph LR
A[用户查询] --> B[检索器]
B --> C[检索文档]
C --> D[重排序]
D --> E[Prompt构建]
E --> F[生成器]
F --> G[输出]
subgraph "检索质量"
B --> B1[检索相关性]
D --> D1[排序准确性]
end
subgraph "生成质量"
E --> E1[引用准确性]
F --> F1[回答质量]
end
RAG vs 普通 LLM 评估差异
| 评估维度 | 普通 LLM | RAG 系统 |
|---|---|---|
| 答案正确性 | 直接评估答案 | 需要结合检索内容验证 |
| 上下文利用 | 无此概念 | 评估是否正确使用检索内容 |
| 检索质量 | 无此概念 | 核心评估维度 |
| 引用准确性 | 无此概念 | 是否正确标注信息来源 |
| 知识时效性 | 模型固有知识 | 可更新检索库 |
RAG 评估框架
其中:
- :检索质量得分
- :生成质量得分
- 建议:
检索质量评估
检索指标体系
| 指标 | 说明 | 计算方法 |
|---|---|---|
| Recall | 检索了多少相关文档 | |
| Precision | 检索文档有多少相关 | |
| MRR | 第一个相关文档位置 | |
| NDCG | 排序质量综合评估 | 见下方公式 |
| Context Relevance | 检索内容与查询相关性 | 语义相似度 |
NDCG 计算:
检索评估实现
# evaluators/retrieval_evaluator.py
from typing import List, Dict
import numpy as np
class RetrievalEvaluator:
"""
RAG检索质量评估器
"""
def evaluate(
self,
query: str,
retrieved_docs: List[Dict],
ground_truth_docs: List[str] = None
) -> Dict:
"""
评估检索质量
Args:
query: 用户查询
retrieved_docs: 检索返回的文档列表
ground_truth_docs: 应该检索到的文档ID列表(如有)
Returns:
检索质量指标
"""
results = {}
# 1. 如果有ground truth,计算传统检索指标
if ground_truth_docs:
retrieved_ids = [doc["id"] for doc in retrieved_docs]
results["recall"] = self._calculate_recall(
retrieved_ids, ground_truth_docs
)
results["precision"] = self._calculate_precision(
retrieved_ids, ground_truth_docs
)
results["mrr"] = self._calculate_mrr(
retrieved_ids, ground_truth_docs
)
results["ndcg"] = self._calculate_ndcg(
retrieved_docs, ground_truth_docs
)
# 2. 计算内容相关性(无需ground truth也可用)
results["context_relevance"] = self._calculate_context_relevance(
query, retrieved_docs
)
# 3. 综合得分
if ground_truth_docs:
results["overall"] = (
results["recall"] * 0.3 +
results["precision"] * 0.2 +
results["mrr"] * 0.2 +
results["ndcg"] * 0.3
)
else:
results["overall"] = results["context_relevance"]
return results
def _calculate_recall(
self,
retrieved: List[str],
ground_truth: List[str]
) -> float:
"""召回率"""
if not ground_truth:
return 0.0
hit = set(retrieved) & set(ground_truth)
return len(hit) / len(ground_truth)
def _calculate_precision(
self,
retrieved: List[str],
ground_truth: List[str]
) -> float:
"""精确率"""
if not retrieved:
return 0.0
hit = set(retrieved) & set(ground_truth)
return len(hit) / len(retrieved)
def _calculate_mrr(
self,
retrieved: List[str],
ground_truth: List[str]
) -> float:
"""平均倒数排名"""
for i, doc_id in enumerate(retrieved):
if doc_id in ground_truth:
return 1.0 / (i + 1)
return 0.0
def _calculate_context_relevance(
self,
query: str,
docs: List[Dict]
) -> float:
"""内容相关性(语义相似度)"""
if not docs:
return 0.0
similarities = []
query_embedding = get_embedding(query)
for doc in docs:
doc_embedding = get_embedding(doc["content"])
sim = cosine_similarity(query_embedding, doc_embedding)
similarities.append(sim)
# 返回top-k的平均相关性
k = min(5, len(similarities))
return np.mean(sorted(similarities, reverse=True)[:k])
生成质量评估
RAG 生成特有指标
| 指标 | 说明 | 评估方法 |
|---|---|---|
| Answer Relevance | 答案是否回应查询 | G-Eval / Semantic Sim |
| Context Utilization | 是否正确使用检索内容 | 对比检查 |
| Faithfulness | 答案是否忠实于检索内容 | 事实一致性检测 |
| Citation Accuracy | 引用标注是否正确 | 规则 + 语义验证 |
| Completeness | 是否完整回答 | G-Eval |
Faithfulness 评估实现
# evaluators/faithfulness_evaluator.py
class FaithfulnessEvaluator:
"""
答案忠实度评估器
检查生成内容是否与检索内容一致
"""
def evaluate(
self,
answer: str,
retrieved_docs: List[Dict]
) -> Dict:
"""
评估答案对检索内容的忠实度
Returns:
忠实度评估结果
"""
results = {
"claims": [],
"supported_claims": 0,
"unsupported_claims": 0,
"score": 0.0
}
# 1. 从答案中提取事实陈述
claims = self._extract_claims(answer)
# 2. 检查每个claim是否有检索内容支持
for claim in claims:
support_found = self._check_claim_support(claim, retrieved_docs)
results["claims"].append({
"claim": claim,
"supported": support_found
})
if support_found:
results["supported_claims"] += 1
else:
results["unsupported_claims"] += 1
# 3. 计算忠实度得分
if claims:
results["score"] = results["supported_claims"] / len(claims)
return results
def _extract_claims(self, answer: str) -> List[str]:
"""
使用LLM从答案中提取事实陈述
示例输出:
["产品价格是99元", "发货时间是3天", ...]
"""
prompt = f"""
请从以下回答中提取所有事实陈述(每行一个):
回答:{answer}
只提取具体的事实陈述,不要包含观点或模糊表述。
"""
response = call_gpt4(prompt, temperature=0)
return [line.strip() for line in response.split("\n") if line.strip()]
def _check_claim_support(
self,
claim: str,
docs: List[Dict]
) -> bool:
"""检查claim是否有文档支持"""
# 合并所有检索内容
context = "\n".join(doc["content"] for doc in docs)
# 使用LLM验证
prompt = f"""
请判断以下陈述是否可以在检索内容中找到支持:
陈述:{claim}
检索内容:
{context}
回答 YES 或 NO。
"""
response = call_gpt4(prompt, temperature=0)
return "YES" in response.upper()
Citation 评估实现
class CitationEvaluator:
"""
引用准确性评估器
"""
def evaluate(
self,
answer: str,
retrieved_docs: List[Dict]
) -> Dict:
"""
评估引用标注是否正确
Returns:
引用评估结果
"""
results = {
"citations_in_answer": [],
"valid_citations": 0,
"invalid_citations": 0,
"missing_citations": False,
"score": 0.0
}
# 1. 提取答案中的引用
citations = self._extract_citations(answer)
results["citations_in_answer"] = citations
# 2. 验证每个引用
valid_ids = set(doc["id"] for doc in retrieved_docs)
for citation in citations:
if citation in valid_ids:
results["valid_citations"] += 1
else:
results["invalid_citations"] += 1
# 3. 检查是否有应该引用但未引用的内容
# 如果答案使用了检索内容但没有引用,标记为缺失
if self._check_unattributed_content(answer, retrieved_docs):
results["missing_citations"] = True
# 4. 计算得分
if citations:
results["score"] = results["valid_citations"] / len(citations)
if results["missing_citations"]:
results["score"] *= 0.8 # 扣分
return results
def _extract_citations(self, answer: str) -> List[str]:
"""
提取引用标注
支持格式: [doc_1], (来源: doc_1), 参见文档doc_1 等
"""
import re
patterns = [
r'\[doc_?\d+\]',
r'\(来源:\s*doc_?\d+\)',
r'文档\s*doc_?\d+',
]
citations = []
for pattern in patterns:
matches = re.findall(pattern, answer)
# 提取doc id
for match in matches:
doc_id = re.search(r'doc_?\d+', match)
if doc_id:
citations.append(doc_id.group())
return citations
RAG Harness 完整实现
整体架构
# main.py for RAG Harness
class RAGHarness:
"""
RAG系统评估Harness
"""
def __init__(self, config: Dict):
self.config = config
# 检索评估器
self.retrieval_eval = RetrievalEvaluator()
# 生成评估器
self.generation_eval = CompositeEvaluator([
FaithfulnessEvaluator(),
CitationEvaluator(),
AnswerRelevanceEvaluator(),
])
async def evaluate(
self,
test_cases: List[RAGTestCase]
) -> Dict:
"""
执行完整RAG评估
Args:
test_cases: RAG测试案例,包含query和ground_truth
Returns:
综合评估报告
"""
results = []
for case in test_cases:
# 1. 执行检索
retrieved_docs = await self.rag_system.retrieve(case.query)
# 2. 执行生成
answer = await self.rag_system.generate(
case.query, retrieved_docs
)
# 3. 评估检索
retrieval_scores = self.retrieval_eval.evaluate(
case.query,
retrieved_docs,
case.ground_truth_docs
)
# 4. 评估生成
generation_scores = self.generation_eval.evaluate(
answer,
case.query,
retrieved_docs
)
# 5. 综合得分
overall = (
retrieval_scores["overall"] * 0.4 +
generation_scores["overall"] * 0.6
)
results.append({
"case_id": case.id,
"query": case.query,
"retrieval": retrieval_scores,
"generation": generation_scores,
"overall": overall
})
# 聚合报告
return self._aggregate(results)
RAG 测试案例结构
@dataclass
class RAGTestCase:
"""
RAG测试案例数据结构
"""
id: str
query: str # 用户查询
# 检索ground truth(可选)
ground_truth_docs: Optional[List[str]] = None # 应该检索的文档ID
# 生成ground truth(可选)
reference_answer: Optional[str] = None # 参考答案
# 元数据
category: str = "general"
difficulty: str = "normal"
# 示例:
# {
# "id": "rag_001",
# "query": "产品退货流程是什么?",
# "ground_truth_docs": ["doc_return_policy", "doc_faq_001"],
# "reference_answer": "退货流程:1.提交申请...",
# "category": "policy",
# "difficulty": "normal"
# }
配置示例
# config/rag_eval_config.yaml
evaluation:
name: "rag_system_eval"
retrieval:
metrics:
- name: recall
weight: 0.3
threshold: 0.8
- name: precision
weight: 0.2
threshold: 0.7
- name: context_relevance
weight: 0.5
threshold: 0.75
generation:
metrics:
- name: faithfulness
weight: 0.35
threshold: 0.85
- name: answer_relevance
weight: 0.35
threshold: 0.75
- name: citation_accuracy
weight: 0.15
threshold: 0.9
- name: completeness
weight: 0.15
threshold: 0.7
overall_weights:
retrieval: 0.4
generation: 0.6
评估报告示例
{
"summary": {
"overall_score": 0.78,
"retrieval_score": 0.82,
"generation_score": 0.75,
"pass_rate": 78%
},
"retrieval_analysis": {
"avg_recall": 0.85,
"avg_precision": 0.72,
"avg_context_relevance": 0.80,
"low_recall_cases": ["rag_015", "rag_032"]
},
"generation_analysis": {
"avg_faithfulness": 0.88,
"avg_answer_relevance": 0.76,
"avg_citation_accuracy": 0.65,
"faithfulness_issues": [
{
"case_id": "rag_045",
"unsupported_claims": ["产品支持海外发货"],
"retrieved_docs": ["doc_shipping_policy"]
}
],
"citation_issues": [
{
"case_id": "rag_023",
"issue": "missing_citation",
"unattributed_content": "根据政策,退货需在7天内"
}
]
},
"recommendations": [
"检索召回率较低,建议优化检索策略或扩大索引范围",
"引用准确性问题突出,建议增强引用标注prompt",
"案例rag_045存在幻觉内容,建议加强faithfulness约束"
]
}
小结
RAG 评估 Harness 的关键要点:
| 评估阶段 | 核心指标 | 实现要点 |
|---|---|---|
| 检索 | Recall, Precision, Context Relevance | 需要ground truth文档列表 |
| 生成 | Faithfulness, Citation, Answer Relevance | 检查与检索内容的一致性 |
| 综合 | 加权组合 | 权重分配:检索40%,生成60% |
下一章,我们将讨论监控与持续优化的闭环设计。