LLM API性能测试完全指南
系统化的性能测试能够帮助您了解API的性能边界,优化系统架构, 确保在高负载下仍能稳定运行。
性能测试维度
⏱️ 延迟测试
- • 首Token延迟 (TTFT)
- • 端到端延迟
- • P50/P95/P99分位数
- • 流式输出延迟
📊 吞吐量测试
- • QPS (每秒查询数)
- • TPS (每秒Token数)
- • 并发用户数
- • 资源利用率
测试工具实现
import asyncio
import time
import statistics
from concurrent.futures import ThreadPoolExecutor
import aiohttp
class LLMPerformanceTester:
"""LLM API性能测试工具"""
def __init__(self, api_key, base_url):
self.api_key = api_key
self.base_url = base_url
self.results = []
async def single_request(self, prompt, session):
"""单个请求测试"""
start_time = time.time()
first_token_time = None
tokens = []
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
data = {
"model": "gpt-4",
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
try:
async with session.post(
f"{self.base_url}/chat/completions",
headers=headers,
json=data
) as response:
async for line in response.content:
if first_token_time is None:
first_token_time = time.time()
# 解析token
tokens.append(line)
end_time = time.time()
return {
"total_time": end_time - start_time,
"ttft": first_token_time - start_time,
"tokens": len(tokens),
"tps": len(tokens) / (end_time - start_time)
}
except Exception as e:
return {"error": str(e)}
async def concurrent_test(self, prompt, num_requests=100):
"""并发测试"""
async with aiohttp.ClientSession() as session:
tasks = []
for _ in range(num_requests):
task = self.single_request(prompt, session)
tasks.append(task)
results = await asyncio.gather(*tasks)
# 分析结果
successful = [r for r in results if "error" not in r]
failed = len(results) - len(successful)
if successful:
latencies = [r["total_time"] for r in successful]
ttfts = [r["ttft"] for r in successful]
tps_values = [r["tps"] for r in successful]
return {
"total_requests": num_requests,
"successful": len(successful),
"failed": failed,
"avg_latency": statistics.mean(latencies),
"p50_latency": statistics.median(latencies),
"p95_latency": self.percentile(latencies, 95),
"p99_latency": self.percentile(latencies, 99),
"avg_ttft": statistics.mean(ttfts),
"avg_tps": statistics.mean(tps_values)
}
def percentile(self, data, p):
"""计算百分位数"""
sorted_data = sorted(data)
index = int(len(sorted_data) * p / 100)
return sorted_data[index]
def load_test(self, prompt, duration=60, rps=10):
"""负载测试"""
start_time = time.time()
results = []
while time.time() - start_time < duration:
# 按指定RPS发送请求
asyncio.run(self.single_request(prompt))
time.sleep(1 / rps)
return self.analyze_results(results)性能基准测试
不同场景的性能指标
| 测试场景 | TTFT | TPS | P95延迟 |
|---|---|---|---|
| 简单问答 | < 500ms | > 80 | < 2s |
| 代码生成 | < 800ms | > 60 | < 5s |
| 长文本生成 | < 1s | > 40 | < 10s |
性能优化建议
客户端优化
- ✅ 使用连接池
- ✅ 实现请求重试
- ✅ 批量处理请求
- ✅ 本地缓存结果
服务端优化
- ✅ 负载均衡
- ✅ 自动扩缩容
- ✅ 边缘缓存
- ✅ 流量控制