mistral-performance-tuning

Optimize Mistral AI performance with caching, batching, and latency reduction. Use when experiencing slow API responses, implementing caching strategies, or optimizing request throughput for Mistral AI integrations. Trigger with phrases like "mistral performance", "optimize mistral", "mistral latency", "mistral caching", "mistral slow", "mistral batch". allowed-tools: Read, Write, Edit version: 1.0.0 license: MIT author: Jeremy Longshore <jeremy@intentsolutions.io>

v1.0.0

Jeremy Longshore

MIT

Allowed Tools

No tools specified

Provided by Plugin

mistral-pack

Claude Code skill pack for Mistral AI (24 skills)

saas packs v1.0.0

View Plugin

Installation

This skill is included in the mistral-pack plugin:

/plugin install mistral-pack@claude-code-plugins-plus

Click to copy

Instructions

# Mistral AI Performance Tuning ## Overview Optimize Mistral AI API performance with caching, batching, and latency reduction techniques. ## Prerequisites - Mistral AI SDK installed - Understanding of async patterns - Redis or in-memory cache available (optional) - Performance monitoring in place ## Latency Benchmarks | Model | P50 | P95 | P99 | Use Case | |-------|-----|-----|-----|----------| | mistral-small-latest | 200ms | 500ms | 1s | Fast responses | | mistral-large-latest | 500ms | 1.5s | 3s | Complex reasoning | | mistral-embed | 50ms | 150ms | 300ms | Embeddings | ## Instructions ### Step 1: Response Caching ```typescript import { LRUCache } from 'lru-cache'; import crypto from 'crypto'; const cache = new LRUCache({ max: 1000, ttl: 5 * 60 * 1000, // 5 minutes updateAgeOnGet: true, }); function getCacheKey(messages: any[], model: string, options?: any): string { const data = JSON.stringify({ messages, model, options }); return crypto.createHash('sha256').update(data).digest('hex'); } async function cachedChat( client: Mistral, messages: any[], model: string, options?: { temperature?: number; maxTokens?: number } ): Promise { // Only cache deterministic requests (temperature = 0) const isCacheable = (options?.temperature ?? 0.7) === 0; if (isCacheable) { const key = getCacheKey(messages, model, options); const cached = cache.get(key); if (cached) { console.log('Cache hit'); return cached; } } const response = await client.chat.complete({ model, messages, ...options, }); const content = response.choices?.[0]?.message?.content ?? ''; if (isCacheable) { const key = getCacheKey(messages, model, options); cache.set(key, content); } return content; } ``` ### Step 2: Redis Distributed Caching ```typescript import Redis from 'ioredis'; import crypto from 'crypto'; const redis = new Redis(process.env.REDIS_URL); async function cachedWithRedis( key: string, fetcher: () => Promise, ttlSeconds = 300 ): Promise { const cached = await redis.get(key); if (cached) { return JSON.parse(cached); } const result = await fetcher(); await redis.setex(key, ttlSeconds, JSON.stringify(result)); return result; } // Semantic cache for similar queries async function semanticCache( client: Mistral, query: string, threshold = 0.95 ): Promise { // Get embedding for query const queryEmbed = await client.embeddings.create({ model: 'mistral-embed', inputs: [query], }); const queryVector = queryEmbed.data[0].embedding; // Check cache for similar queries const cachedQueries = await redis.keys('semantic:*'); for (const key of cachedQueries) { const cached = JSON.parse(await redis.get(key) || '{}'); const similarity = cosineSimilarity(queryVector, cached.embedding); if (similarity >= threshold) { console.log(`Semantic cache hit (similarity: ${similarity.toFixed(3)})`); return cached.response; } } return null; } ``` ### Step 3: Request Batching ```typescript import DataLoader from 'dataloader'; // Batch embedding requests const embeddingLoader = new DataLoader( async (texts) => { const response = await client.embeddings.create({ model: 'mistral-embed', inputs: texts as string[], }); return response.data.map(d => d.embedding); }, { maxBatchSize: 100, // Mistral limit batchScheduleFn: callback => setTimeout(callback, 10), // 10ms window } ); // Usage - automatically batched const [embed1, embed2, embed3] = await Promise.all([ embeddingLoader.load('Text 1'), embeddingLoader.load('Text 2'), embeddingLoader.load('Text 3'), ]); ``` ### Step 4: Connection Optimization ```typescript import { Agent } from 'https'; import Mistral from '@mistralai/mistralai'; // Keep-alive connection pooling const agent = new Agent({ keepAlive: true, maxSockets: 10, maxFreeSockets: 5, timeout: 60000, }); // Note: Check if Mistral client supports custom agents // If not, connection pooling happens at the HTTP level ``` ### Step 5: Streaming for Perceived Performance ```typescript // Streaming reduces Time to First Token (TTFT) async function* streamWithMetrics( client: Mistral, messages: any[], model: string ): AsyncGenerator<{ content: string; metrics: any }> { const startTime = Date.now(); let firstTokenTime: number | null = null; let tokenCount = 0; const stream = await client.chat.stream({ model, messages }); for await (const event of stream) { const content = event.data?.choices?.[0]?.delta?.content; if (content) { if (!firstTokenTime) { firstTokenTime = Date.now(); } tokenCount++; yield { content, metrics: { ttft: firstTokenTime - startTime, tokensPerSecond: tokenCount / ((Date.now() - startTime) / 1000), }, }; } } } // Usage let fullResponse = ''; for await (const { content, metrics } of streamWithMetrics(client, messages, 'mistral-small-latest')) { fullResponse += content; process.stdout.write(content); } console.log(`\nTTFT: ${metrics.ttft}ms, Speed: ${metrics.tokensPerSecond.toFixed(1)} tok/s`); ``` ### Step 6: Model Selection for Speed ```typescript type SpeedTier = 'fastest' | 'balanced' | 'quality'; function selectModelForSpeed(tier: SpeedTier, taskComplexity: 'low' | 'medium' | 'high'): string { const matrix = { fastest: { low: 'mistral-small-latest', medium: 'mistral-small-latest', high: 'mistral-small-latest', }, balanced: { low: 'mistral-small-latest', medium: 'mistral-small-latest', high: 'mistral-large-latest', }, quality: { low: 'mistral-small-latest', medium: 'mistral-large-latest', high: 'mistral-large-latest', }, }; return matrix[tier][taskComplexity]; } ``` ### Step 7: Performance Monitoring ```typescript interface PerformanceMetrics { model: string; latencyMs: number; ttftMs?: number; tokensPerSecond?: number; inputTokens: number; outputTokens: number; cached: boolean; } async function measurePerformance( operation: () => Promise, metadata: Partial ): Promise<{ result: any; metrics: PerformanceMetrics }> { const start = Date.now(); const result = await operation(); const metrics: PerformanceMetrics = { model: metadata.model || 'unknown', latencyMs: Date.now() - start, inputTokens: result.usage?.promptTokens || 0, outputTokens: result.usage?.completionTokens || 0, cached: metadata.cached || false, ...metadata, }; // Log to monitoring system console.log('[PERF]', JSON.stringify(metrics)); return { result, metrics }; } // Usage const { result, metrics } = await measurePerformance( () => client.chat.complete({ model, messages }), { model, cached: false } ); ``` ## Output - Reduced API latency - Caching layer implemented - Request batching enabled - Performance monitoring active ## Error Handling | Issue | Cause | Solution | |-------|-------|----------| | Cache miss storm | TTL expired | Use stale-while-revalidate | | Batch timeout | Too many items | Reduce batch size | | Memory pressure | Cache too large | Set max cache entries | | Slow TTFT | Large prompts | Reduce prompt size or use smaller model | ## Examples ### Quick Performance Wrapper ```typescript const withPerformance = async ( name: string, fn: () => Promise ): Promise => { const start = Date.now(); const result = await fn(); console.log(`[${name}] ${Date.now() - start}ms`); return result; }; // Usage const response = await withPerformance('chat', () => client.chat.complete({ model, messages }) ); ``` ### Parallel Requests with Concurrency Limit ```typescript import pLimit from 'p-limit'; const limit = pLimit(5); // Max 5 concurrent requests const results = await Promise.all( prompts.map(prompt => limit(() => client.chat.complete({ model: 'mistral-small-latest', messages: [{ role: 'user', content: prompt }], })) ) ); ``` ## Resources - [Mistral AI Models](https://docs.mistral.ai/getting-started/models/) - [LRU Cache Documentation](https://github.com/isaacs/node-lru-cache) - [DataLoader Documentation](https://github.com/graphql/dataloader) ## Next Steps For cost optimization, see `mistral-cost-tuning`.