LLM API性能测试完全指南

系统化的性能测试能够帮助您了解API的性能边界,优化系统架构, 确保在高负载下仍能稳定运行。

性能测试维度

⏱️ 延迟测试

  • • 首Token延迟 (TTFT)
  • • 端到端延迟
  • • P50/P95/P99分位数
  • • 流式输出延迟

📊 吞吐量测试

  • • QPS (每秒查询数)
  • • TPS (每秒Token数)
  • • 并发用户数
  • • 资源利用率

测试工具实现

import asyncio
import time
import statistics
from concurrent.futures import ThreadPoolExecutor
import aiohttp

class LLMPerformanceTester:
    """LLM API性能测试工具"""
    
    def __init__(self, api_key, base_url):
        self.api_key = api_key
        self.base_url = base_url
        self.results = []
    
    async def single_request(self, prompt, session):
        """单个请求测试"""
        start_time = time.time()
        first_token_time = None
        tokens = []
        
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
        
        data = {
            "model": "gpt-4",
            "messages": [{"role": "user", "content": prompt}],
            "stream": True
        }
        
        try:
            async with session.post(
                f"{self.base_url}/chat/completions",
                headers=headers,
                json=data
            ) as response:
                async for line in response.content:
                    if first_token_time is None:
                        first_token_time = time.time()
                    
                    # 解析token
                    tokens.append(line)
                
                end_time = time.time()
                
                return {
                    "total_time": end_time - start_time,
                    "ttft": first_token_time - start_time,
                    "tokens": len(tokens),
                    "tps": len(tokens) / (end_time - start_time)
                }
        except Exception as e:
            return {"error": str(e)}
    
    async def concurrent_test(self, prompt, num_requests=100):
        """并发测试"""
        async with aiohttp.ClientSession() as session:
            tasks = []
            for _ in range(num_requests):
                task = self.single_request(prompt, session)
                tasks.append(task)
            
            results = await asyncio.gather(*tasks)
            
        # 分析结果
        successful = [r for r in results if "error" not in r]
        failed = len(results) - len(successful)
        
        if successful:
            latencies = [r["total_time"] for r in successful]
            ttfts = [r["ttft"] for r in successful]
            tps_values = [r["tps"] for r in successful]
            
            return {
                "total_requests": num_requests,
                "successful": len(successful),
                "failed": failed,
                "avg_latency": statistics.mean(latencies),
                "p50_latency": statistics.median(latencies),
                "p95_latency": self.percentile(latencies, 95),
                "p99_latency": self.percentile(latencies, 99),
                "avg_ttft": statistics.mean(ttfts),
                "avg_tps": statistics.mean(tps_values)
            }
    
    def percentile(self, data, p):
        """计算百分位数"""
        sorted_data = sorted(data)
        index = int(len(sorted_data) * p / 100)
        return sorted_data[index]
    
    def load_test(self, prompt, duration=60, rps=10):
        """负载测试"""
        start_time = time.time()
        results = []
        
        while time.time() - start_time < duration:
            # 按指定RPS发送请求
            asyncio.run(self.single_request(prompt))
            time.sleep(1 / rps)
        
        return self.analyze_results(results)

性能基准测试

不同场景的性能指标

测试场景TTFTTPSP95延迟
简单问答< 500ms> 80< 2s
代码生成< 800ms> 60< 5s
长文本生成< 1s> 40< 10s

性能优化建议

客户端优化

  • ✅ 使用连接池
  • ✅ 实现请求重试
  • ✅ 批量处理请求
  • ✅ 本地缓存结果

服务端优化

  • ✅ 负载均衡
  • ✅ 自动扩缩容
  • ✅ 边缘缓存
  • ✅ 流量控制

优化您的API性能

通过系统化的性能测试,确保您的AI应用在任何负载下都能稳定高效运行。

开始测试