Zing: A Production Cross-Platform Soft Phone
How we built a Twilio-powered VoIP and SMS soft phone for web, iOS, and Android — from a single React codebase — with enterprise-grade call management, conferencing, and AI-assisted routing.
The Challenge: Enterprise Communication Without Enterprise Complexity
Enterprise contact centers and field teams face a common problem: they need professional-grade voice and SMS capabilities accessible from any device — desktop browser, iOS, Android — but traditional VoIP hardware is expensive, inflexible, and poor at integrating with modern AI workflows. Cloud soft phones exist, but most are either overly generic (built for consumer use) or prohibitively expensive for mid-market deployment.
We built Zing to solve this for our own operational needs and to demonstrate that a production-grade, multi-platform communications platform is achievable in a single engineering iteration — without compromise on reliability, feature depth, or the integration surface needed to connect it to AI-assisted workflows.
The Architecture: One Codebase, Three Platforms
The core engineering decision was to build on React with Capacitor, targeting web, iOS, and Android from a single codebase. This approach has important trade-offs: you gain development velocity and feature parity across platforms, but you must manage platform-specific audio routing, push notification behavior, and native SDK integration carefully or the mobile experience degrades relative to what users expect from a purpose-built native app.
We solved the platform-specific challenges through custom Capacitor plugins. The TwilioVoice plugin bridges the Twilio Voice native SDK to the React frontend, handling WebRTC negotiation, CallKit integration on iOS (for native audio routing and lock-screen call controls), and AudioRoute management on Android. These plugins ensure that the calling experience — particularly audio quality, backgrounding behavior, and call notifications — meets native app standards rather than browser-tab standards.
Technical Stack
- React 18 + Capacitor 6.2
- Twilio Voice SDK v2.11
- Custom TwilioVoice plugin (iOS/Android)
- Custom AudioRoute plugin (Android)
- WebRTC via Twilio Device API
- Flask 3.1 (Python)
- JWT authentication
- MySQL + SQLAlchemy 2.0
- Twilio REST API (voice + SMS)
- TwiML webhook handlers
Feature Scope: What Zing Delivers
The feature set covers the full lifecycle of enterprise voice and messaging:
Voice Calls
- Outbound via dialpad
- Inbound accept/decline
- Mute, DTMF, hold
- Conference calls
- Warm transfer
- CallKit (iOS lock screen)
Messaging
- SMS send/receive
- Conversation view by contact
- Message history persistence
- Real-time incoming via webhook
Contact Management
- Add/edit/delete contacts
- Call history per contact
- Click-to-call from contact
- SMS from contact
Operations
- Full call log with duration
- Participant roster for conferences
- JWT-based authentication
- Token refresh without interruption
The Engineering Challenge: Real-Time Audio Across Three Platforms
The hardest part of building a cross-platform soft phone isn't the call initiation or the UI — it's ensuring audio quality and call state management are consistent and reliable across the three very different runtime environments (web browser, iOS UIKit, Android Activity) that Capacitor targets.
On iOS, calls must integrate with CallKit to appear on the lock screen and route audio through the earpiece correctly when the phone is against the user's ear. Without this integration, the app behaves like a browser tab making a phone call — which is not what enterprise users expect. We wrote a custom Capacitor plugin that bridges CallKit to the React frontend, handles audio session activation/deactivation around calls, and manages the interaction with Bluetooth devices and AirPods.
On Android, audio routing behaves differently: the AudioManager API must be explicitly configured for each call state (in-call mode, speakerphone, Bluetooth routing). A second custom plugin handles this, ensuring audio routes correctly whether the user is on a handset, a Bluetooth headset, or a speaker — without requiring the user to manually switch audio outputs.
On web, the Twilio Voice SDK handles WebRTC natively, but browser audio permissions, echo cancellation, and microphone selection required careful handling — particularly for enterprise use cases where users may have multiple audio devices connected simultaneously.
Why This Case Study Matters for Enterprise AI Buyers
Zing exists because we needed it — and because building it demonstrates something important about how SolvIT AI approaches product development: we build to production standards, not proof-of-concept standards. The same engineering discipline that produced Zing — real-time WebRTC audio, native plugin development, JWT auth, production database modeling, webhook handling at scale — is what we bring to enterprise client engagements.
When a prospective client asks "can you build [complex, multi-platform, real-time system]?" — Zing is the answer. Not a slide deck about our capabilities, but a working, deployed application that uses the same technology stack and the same architectural rigor we apply to client work.
For organizations considering AI-powered communication workflows — intelligent call routing, automatic call summarization, real-time sentiment analysis for customer service teams, or AI-assisted dispatch — Zing's architecture is the foundation those workflows are built on. The agentic automation layer connects directly to the communications pipeline, enabling workflows like: AI classifies incoming call intent, routes to the right agent, generates a post-call summary, and triggers the appropriate follow-up workflow automatically.
Building a communications platform or need AI integrated into your contact center?
We've done it. Let's talk about applying the same architecture to your environment.
Schedule a Technical Consultation