WebRTC — Basic concepts and creating a simple video call app
In this post, we will learn about WebRTC and how we can use it easily to set up a simple peer-2-peer video calling app.
This post will have two sections, first I will explain in simple terms how WebRTC works and some of the key terminologies.
In the next section, we will create a simple video chat application using javascript and the phoenix framework(optional) to see it all in action.
So let's jump right in…
SECTION I — WebRTC Concepts
What is WebRTC?
WebRTC stands for Web Real-Time Communications. WebRTC is very popular these days and there is a high chance that any video communication app that you might be using daily like google meet, etc in using WebRTC under the hood for real-time video communication.
WebRTC enables sending voice, video, and any arbitrary data across browsers in real-time in a peer-to-peer fashion.
Unlike the usual client-server paradigm, peer-2-peer(P2P) is a technology where two clients can communicate directly with each other.
The only standardized means for doing that across web browsers is by using WebRTC. P2P reduces the load on servers, reduces latency of messages, and increases privacy.
How does it work?
Imagine you want to connect to your friend on a video call that is using WebRTC, these are the high-level steps that would happen…
The Offer
First, your browser would create an offer, this offer would result in the creation of an SDP (session description protocol) object which would contain information like video codec, timing, etc.
This offer would then somehow be sent to your friend(the remote peer) asking them to connect with you using WebRTC.
So, SDP is used by WebRTC to negotiate the session’s parameters.
The Answer
Now your friend(the remote peer) has to answer the SDP offer that it received.
To answer the call or the offer your friend has to create an SDP answer and somehow send that back to you.
An SDP object looks somewhat like this…
The Signaling server
The offer and answer data contains SDP objects that are used by WebRTC to negotiate the session’s parameters.
Notice, how we talk about sending the offer to your friend and your friend sending the answer back to you, well there has to be some way to exchange this data. This is where a signaling server is useful.
A signaling server can be any 3rd party server, it has the sole purpose of signaling which facilitates the exchange of messages between the 2 peers(you and your friend).
Signaling servers often use WebSockets to exchange information between 2 peers like offer data, answer data, and ICE candidates(more on this later).
Later in this blog, we will create our own signaling server and dive into the details, so stay tuned.
Thus, the signaling server(often implemented via WebSockets) allows 2 peers to securely exchange connection data in the form of SDP objects but never touches the data itself, that is actually transmitted between the peers themselves via WebRTC.
STUN servers and TURN servers
In order to exchange media between two peers via WebRTC, they need to know each other’s IP addresses.
However, in real world users/browsers/peers often sit behind firewalls and IP addresses constantly change due to NATs (Network address translation — Translates internal private IP addresses to external public IP addresses).
This makes peer-to-peer connections complicated, to solve this we need to find the public-facing IP address of a peer this is where STUN servers are helpful.
STUN (Session Traversal Utilities for NAT)
For the peers to know their public IP address for the purpose of successfully connecting to each other via WebRTC it sends a request to a STUN server asking for its public IP address.
The server then replies back to the requester with its public IP address. This way, the WebRTC client learns what its public IP address is.
The WebRTC client then shares its public IP address with the remote peer.
In other words, you will first have to send a request to a STUN server that replies back to you with your public-facing IP address, you can then send this IP address to your friend(the remote peer). Your friend also must go through a similar process and send you their public-facing IP.
TURN ( Traversal Using Relays around NAT)
Sometimes STUN servers might not always since with some network architecture and NAT device types, the public IP address obtained via STUN will not work in the case of symmetric NATs.
This is why it is used in conjunction with TURN and ICE.
TURN servers are useful to relay media when the use of STUN isn’t possible. The decision of whether to use STUN or TURN is orchestrated by a protocol called ICE.
With a TURN server, we relay all the media through it, this can be expensive since it costs bandwidth and CPU on the server.
This is why unlike STUN servers which are often publically available TURN servers aren’t usually available publically and need to install and maintained separately (or pay for a hosted service).
Thus a STUN server is used to get an external network address and TURN servers are used to relay traffic if the direct (peer-to-peer) connection fails.
Every TURN server supports STUN. A TURN server is a STUN server with additional built-in relaying functionality.
ICE (Interactive Connectivity Establishment)
It is a standard method of NAT traversal used in WebRTC. ICE works to punch open ports in the firewalls
ICE deals with the process of connecting media through NATs by conducting connectivity checks.
Each address received from the STUN server is called ICE candidate (An ICE candidate contains a potential IP address and port pair that can be used to establish a peer-2-peer connection.)
ICE collects all available candidates (local IP addresses, reflexive addresses — STUN ones, and relayed addresses — TURN ones). All the collected addresses are then sent to the remote peer via SDP.
Once the WebRTC Client has all the collected ICE addresses of itself and its peer, it starts initiating a series of connectivity checks. These checks essentially try sending media over the various addresses until success.
The algorithm then decides which ICE candidate is the best and will be used to transmit the real data.
Okay, now remember we talked about a signaling server, well that server will also help us to exchange the ICE candidates between two peers (you and your friend).
The Flow
Now let's recap all the things we learned so far and see how they work together.
So when two peers peer-1 and peer-2 want to communicate with each other, first peer-1 sends an Offer(SDP object) to peer-2 via the signaling server.
Peer-2 then accepts this offer and sends an answer(SDP object) back to peer-1 via the signaling server.
Now, peer-1 and peer-2 communicate with a STUN server to find their public-facing IPs. In case if this does not work due to firewalls or NAT-related issues then the 2 peers can use a TURN server instead to relay media vai the TURN server.
The peers now exchange their IP and port pairs called ICE candidates. The ICE candidates are exchanged via the signaling server.
At this point, the WebRTC session is connected and the peers can either exchange data directly (P2P) between themselves or relay it via a TURN server.
SECTION-II — Coding a simple video chat app using WebRTC
Now that we have some idea of how video communication takes place using WebRTC, let's code a simple P2P video calling app which uses WebRTC.
In an actual production environment, it is often a good idea to choose some WebRTC provider like tokbox since all the complexity of maintaining STUN or TURN servers are handled by them, and they also provide easy to use SDKs for various languages to easily add audio/video calling features without diving into the details of WebRTC.
However, in this blog, I will not be using any 3rd party services and would only use the WebRTC API that browsers come with. This code example is a modified form of the awesome blog post fireship.io I would definitely suggest giving it a read.
The original code shared by fireship.io uses firebase as the signaling server which is very easy and avoids having to create our own signaling server.
However, in this post, we will learn how we can create our own signaling server using the Phoenix framework.
If you are not aware of elixir and phoenix I would just you can go through the original blog post from fireship here which uses firebase for signaling.
This is what the finished app would look like…
You need to connect to the WebRTC session from two different browsers, which would connect via WebRTC and share video and audio P2P.
The Code
Find the complete code example in this repo.
The client-side javascript code
First, set up a simple HTML page like…
The code is almost self-explanatory with lots of comments, so let's dive in…
Set up the phoenix socket connection to be used for signaling along with some helper functions. We are using the phoenix js javascript library to connect to the phoenix channels which we will be using later for signaling.
Setup some global variables and an RTCPeerConnection Object
Set up the media sources by requesting media from the user's webcam this can be easily achieved by the getUserMedia function the browsers provide.
After we get the user's webcam stream we can add them to a video element in our HTML page and we also add the tracks of the media stream to the RTCPeerConnection object since media from these tracks will have to send to the remote peer.
We also set up an event listener “ontrack” on the RTCPeerConnection object to listen for new tracks from the remote peer, once a new track from the remote peer is available we attach it to a video element “remoteVideo” on the HTML page.
The Local Peer (The browser which initiates the WebRTC call)
Now we will write the code to initiate the WebRTC call from the local peer.
Here we first connect to the signaling server by joining the phoenix channel topic using a unique call-id that we generated.
Next, we set up an event listener “onicecandidate” on RTCPeerConnection object. Now when ICE candidates from the local peer are available they will be sent to the signaling server which will further send them to the remote peer once it connects.
Now we create an offer from the local peer and send it to the remote peer via our signaling server.
We also set up an event listener on the signaling server to listen for the answer from the remote peer and also to listen for ICE candidates from the remote peer. Once we obtain these data we add them to our RTCPeerConnection object.
The Remote peer (The browser which answers the webRTC call)
Now we write the code that will be executed on the remote peer. We first connect to our signaling server, when joining the channel of the signaling server we receive any ICE candidates that the caller might have sent.
We then request the offer data from the signaling server and upon receiving the offer we add it as the remote description on our RTCPeerConnection object.
We also send our ICE candidates whenever they are available to the caller via the signaling server.
Now that we have the offer from the caller we prepare an answer to that offer and send it to the caller peer via the signaling server.
We also add any ice candidates we have received from the caller in our RTCPeerConnection object.
Finally, we set up an additional event listener on the signaling serve to add any ICE candidates we might later receive from the caller.
And that's all the javascript we need to set up the WebRTC video communication between two browsers.
The signaling server
Now we will set up a simple signaling server with the help of the awesome phoenix framework and phoenix channels. (If you are not comfortable with phoenix I would suggest going through this blog post which uses firebase for signaling).
In the signaling server, we will have to store the offer details and the ICE candidates from the local peer until the remote peer joins the call. For this purpose, we are using a simple elixir genserver and storing the information in a map as the genserver state.
First, let's create a simple elixir struct that will be used to store the offer details and ice candidates for every call.
https://github.com/Arp-G/simple-wrtc-demo/blob/master/lib/simple_wrtc_demo/call.ex
Next, let's create a genserver with some public APIs that can be used to store the data in memory.
https://github.com/Arp-G/simple-wrtc-demo/blob/master/lib/simple_wrtc_demo/calls_store.ex
You can add a supervisor so that your Genserver is restarted in case it crashes for some reason. You can then add this supervisor to your main Application so that it gets started automatically whenever you start the server.
https://github.com/Arp-G/simple-wrtc-demo/blob/master/lib/simple_wrtc_demo/store_supervisor.ex
Now make sure you add “SimpleWrtcDemo.StoreSupervisor” as a child in your SimpleWrtcDemo.Application module.
Finally, we can now create the call channel which will receive and broadcast messages between the two peers connected to the call.
The code should be mostly self-explanatory with the comments.
Notice how we use broadcast_from/3 to avoid sending messages to the peer who sent them.
That's all folks! That's all we need we need for our signaling server to work.
Group Video calls?
Finally, you might be wondering how we can do group video calls using WebRTC for example how google meet does it?
Using webRtc we can do peer-2-peer connections but in order to do group video calls since each participant connects to each other as a peer creating a mesh network.
However, only a restricted number of participants (nearly 4–6) can connect with each other. Since each participant sends media to each other it requires N-1 uplinks and N-1 downlinks where N is the total number of peers in the group video call.
In such cases, we can use a central server that relays the data to every other peer so each participant has only one uplink and one downlink.
However, the central server makes decoding and encoding each participant’s media, which requires high processing power.
Conclusion
So today we learned about the basic concepts of WebRTC and saw how we can use the browser's WebRTC API to easily create a peer-2-peer video call app.
We also coded our own signaling server using the phoenix framework to exchange data between the 2 peers to initiate the WebRTC call.
The code examples given in this blog may not be suitable for production usage and I would suggest using a 3rd party WebRTC provider instead for production.
Lastly, I am leaving some of the awesome resources that I found without which this blog would not have been possible.
- Glossary of WebRTC terminologies: https://webrtcglossary.com/
- Code with an explanation from fireship.io: https://fireship.io/lessons/webrtc-firebase-video-chat/
- Stack Overflow explanations: https://stackoverflow.com/questions/12708252/how-does-webrtc-work
That's all for today, thanks if you read till the end, have a Good day!