I’m working on an intent classification system for a conversational AI platform at Rizz AI lovers using WordPress REST API + vanilla JavaScript and facing critical technical bottlenecks with OpenAI API integration.
Core Issues:
Multi-Intent Classification: Complex queries like “Hey, how do you do?” yield inconsistent JSON responses with ~15% classification errors using GPT-3.5-turbo. Should I switch to function calling vs structured prompts?
Context Management: 8+ turn conversations hit 4k token limits. Current sliding window (last 3 exchanges) loses critical context affecting accuracy. Vector embeddings for context retrieval worth exploring?
Latency Problem: OpenAI API averages 800ms, need <200ms for real-time chat. Local models (DistilBERT) offer speed but accuracy drops from 94% to 78%.
Cost Scaling: 50k tokens/day projecting $150+ monthly just for intent classification. Redis caching only achieves 40% hit rate.
Technical Constraints: WordPress plugin async processing limitations, JavaScript promise chain complexity, session persistence across page reloads.
What I’ve Tried: Temperature tuning (0.1-0.3), system message optimization, prompt chaining, fine-tuning on 500+ examples.
Seeking: Hybrid architecture patterns (local + OpenAI), production benchmarks OpenAI vs local models, cost-effective scaling strategies, real-world error handling approaches.
Anyone solved similar challenges in production conversational AI? Particularly interested in WordPress-based implementations and latency optimization techniques.