-
Notifications
You must be signed in to change notification settings - Fork 1.7k
MediaAPI
This document details a completely new media API for Pion WebRTC. The current media API has deficiencies that prevent it from being used in a few production workloads. This document doesn't aim to modify/extend the existing API, we are looking at it with fresh eyes.
I encourage everyone to comment on this page! When adding comments add them in italics and include your GitHub username I believe this API can be improved by doing X -- Sean-Der
If you can think of more use cases please provide them, this list is not exhaustive!
A user has audio/video file on disk and wants to send the content to many viewers. There will be no congestion control, you will have some loss handling (NACK). If the remote viewer doesn't support the codec we offer handshaking will fail.
A user has an existing RTP feed (RTSP camera), and wants to send the content to many viewers. There will be no congestion control, you will have some loss handling (NACK). If the remote viewer doesn't support the codec we offer handshaking will fail.
A user will be encoding content and sending to many viewers, this could be an MCU, capturing a webcam or desktop (like github.com/nerdism/neko). There will be congestion control, and packet loss handling (NACK/PLI). The user should be informed of the codecs the remote supports, and then be able to generate on the fly what is requested.
A user wants to save media from a remote peer to disk. This could be for playback later, or some other async task. We need to ensure the best experience possible by providing loss handling, and congestion control. Latency doesn't matter as much.
A user wants to consume media from a remote peer live. This could be used for processing (like GoCv) or playing back live. We need to ensure the best experience possible by providing loss handling, and congestion control. We will also need to be careful to not add much latency, this could hurt the entire experience.
Users should be able to build the classical SFU use cases. For each Peer you will have one PeerConnection, and transfer all tracks across that. If possible we should support Simulcast and SVC. However if nothing is supported we should just request the lowest bitrate that works for all peers. Beyond that we should pass everything through and let de-jitter happen on each receiver side. This needs more research.
Users should be able to write idiomatic WebRTC code that works in both their native and Web applications. They should be able to call getUserMedia and have it work across both platforms. This portability is also very important for our ability to test.
An exact API will be defined below, this is a high level of what the user interaction will look like.
A user on startup will declare what codecs they will support.
The user can add/remove from a list of RTCRtpCodecCapability
This allows us to express
- All codecs (H264, Opus, VPx)
- Attributes of that codec (packetization, profile)
- RTCPFeedback (NACK, REMB)
A user creates a MediaStreamTrack by either calling mediadevices.getUserMedia()
OR creating a Track via webrtc.NewTrack(kind RTCCodeType, id, label string, func(RtpSender, supportedCodecs []RTCRtpCodecCapability) (RTCRtpCodecCapability, error)
Tracks must match MediaStreamTrack, so codec/ssrc will no longer be defined at the Track level.
No change from the current Pion API, peerConnection.AddTrack(track)
On SetRemoteDescription
a callback is fired on MediaStreamTrack with a RtpSender and supported codecs
Every time a PeerConnection that has added that track has finished signaling a callback is fired. Only then do we know the intersection of codecs. We can't pick H264 (or VPx) until we know the other side supports it.
func(sender RtpSender, supportedCodecs []RTCRtpCodecCapability) (RTCRtpCodecCapability, error) {
if (len(supportedCodecs) == 0) {
return fmt.Errorf("No supported codecs")
}
fanOutSlice = append(sender, fanOutSlice)
}
The example above shows the typical fan-out case. We get a new RtpSender, and then we add it to a list that another goroutine is looping and writing. When one of the RTPSenders returns io.EOF it removes it from the list. This was possible with the Pion API today, but here are the problems it does solve.
Juggling these values makes the API hard to use. Browsers use different PayloadTypes, so this creates a lot of pain for users. It is also hard to debug when an SSRC is wrong.
You don't know if the remote supports H264/VP9/AV1. You now can pick which codec you prefer out of all the intersections.
The current API doesn't allow us to implement congestion control or error correction easily. By instead giving the user direct access to the RTPSender they have the hooks they need.
The user shouldn't need to do the math. Internally we should convert it to a sample rate and pass to pion/rtp
We will provide a sensible default, but these will both be interfaces that a user just has to satisfy. This is out of the scope of this document, the only thing we need to ensure is that it is possible without a API break.
A user can then go and interact with the JitterBuffer/CongestionController as they wish. If they want to mutate it at runtime or modify values. This will allow them to choose how much loss they are willing to tolerate etc.. This will also be helpful for building an SFU. You can have a CongestionController where you can set the upper bound being the lowest of all recievers. The REMB is then constructed and sent back to the reciever.
We will put two callbacks on the RTPSender, and the user can ignore them if they wish. These aren't portable, but I think putting them in the SettingEngine is the wrong thing to do.
RtpSender.OnBitrateSuggestion(func(bitrate float) {
})
RtpSender.OnKeyframeRequest(func() {
})
This will capture a video device and will work in WASM or Go mode. When running in WASM mode the VP8 selection has no impact though. In the future if the WebRTC API allows that we will support it though.
func main() {
// We only want to send VP8
s := webrtc.SettingEngine{
Codecs: []RTCRtpCodecCapability{
webrtc.RTCRtpCodecCapabilityDefaultVP8,
},
}
api := webrtc.NewAPI(webrtc.WithSettingEngine(s))
peerConnection, err := api.NewPeerConnection(webrtc.Configuration{})
track, err :=mediaDevices.GetUserMedia({Video: true})
peerConnection.AddTrack(track)
}
I think we should allow users to encode their own videos/audios because the tracks that we receive from GetUserMedia
should be still in raw format (because we need to be able to transform the video/audio). The following shows the data flow starting from GetUserMedia
and ending at the other peer.
Reference: https://w3c.github.io/mediacapture-main/#the-model-sources-sinks-constraints-and-settings
This diagram shows that the data from the source can be broadcasted and transformed. Allowing users to encode their own videos/audios also gives some extra benefits for the users:
- Fan-out video to many PeerConnection
- Use the source for other outputs, e.g. simply stream mjpeg through HTTP server
- Transform the source, the change will be reflected to all of the listeners
- Each listener has the option to transform the source without affecting other listeners
So, I propose that we should have a functional option to allow users to give their encoders.
type LocalTrack interface {
ReadRTP() (*rtp.Packet, error)
// The following methods allow PeerConnection to use RTCP Feedback to automatically control the input
// SetBitRate sets current target bitrate, lower bitrate means smaller data will be transmitted
// but this also means that the quality will also be lower.
SetBitRate(int) error
// ForceKeyFrame forces the next frame to be a keyframe, aka intra-frame.
ForceKeyFrame() error
}
type EncoderBuilder interface {
Codec() webrtc.RTPCodec
// Notice that this signature is opaque. This allows pion/webrtc to stay Pure Go.
// The idea is to not require the main pion/webrtc package to know the input format from the track,
// it only needs to care how to handle the encoded version. This way, we let the users decide
// whatever format they wish, which leads to a flexible design. But, since it is opaque,
// it'll be more error-prone and feels more "magical".
BuildEncoder(Track) (LocalTrack, error)
}
type SettingEngine struct{
// internal stuff
}
func (engine *SettingEngine) WithEncoders(encoders ...EncoderBuilder) {}
func (pc *PeerConnection) AddTrack(track Track) {
// step 1: find common supported codec builders from SettingEngine
// note 1.1: if there are multiple codecs as the result, try to build in sequential order,
// if one fail, use the next ones. This is useful if we have 2 or more codec implementations. We allow users,
// to prioritize some encoders, e.g. hardware accelerated codecs (it's common to fail since the device
// might not have hardware support).
// step 2: create a local track using the encoder builder
// step 2: create a new RTPSender
// step 3: replace the RTPSender's local track from step 3 with the local track from step 2
}
This design is actually similar to what Chromium does, https://chromium.googlesource.com/external/webrtc/+/refs/heads/master/media/engine/webrtc_media_engine.h. They have a MediaEngine and it has an API to set encoder builders, later PeerConnnection can build encoders on the fly.
Note: I've created a couple of POCs in mediadevices:
- Non-WebRTC: https://github.com/pion/mediadevices/blob/redesign/examples/simple/main.go
- Broadcast your camera stream through MJPEG server
- WebRTC: https://github.com/pion/mediadevices/blob/redesign/examples/webrtc/main.go
- Classic 1:1 WebRTC example using jsfiddle
Maybe consider how this ties into a broader (Go) media pipeline? Over time you could build out building blocks like enabling Picture-in-Picture, etc. -- Backkem
- How do accomplish SVC?
- How do we accomplish Simulcast
My view is that pion/webrtc should provide a fully compatible webrtc api with additional enhancements to cover all the use cases.
To do this we should introduce MediaTrack
as a container of raw data and make the RTPReceiver
/RTPSender
read/write to MediaTrack
and decode/encode rtp streams.
The main issue is that the standard webrtc api won't let us satisfy all the possible kind of uses cases:
- Create a full go webrtc client
- Create a simple MCU
- Create an SFU
- Basically all the examples in pion/webrtc
That's because in the above use cases a pion/webrtc user have the need to read/write from the rtp/rtcp streams, manipulating the packets.
- An SFU has the need to read the rtp packets from receivers streams, change they header sequence/ssrc/timestamp and also some vp8/vp9/av1 packet headers (i.e vp9 pictureID and TL0PICIDX). Choose which simulcast streams to send or which vp8/vp9/av1 (svc) layers to send.
- It also has the need to read rtcp packets to implement their own nack handling (requires a custom buffer of last sent packets), jitter buffer, congestion control algorithms etc...
- The swap-tracks example has the same needs as the sfu.
- The twitch stream example has the same needs, we don't want to re-encode the stream already decoded by the receiver (this will use much more cpu) but use the incoming rtp stream.
(as a consideration, the current pion/webrtc v2 api permits the creation of what the standard webrtc api cannot do but doesn't permit what webrtc api can do...)
- Make two kind of rtpreceiver, rptsender and track. A webrtc spec mode and a raw mode receiver/sender/track.
- The Receiver will fully terminate the rtp/rtcp streams. It'll contain a decoder and a controller to handle incoming rtcp packets (sdes etc...) and send rtcp packets (nack, pli, fir, remb etc...). A receiver could receiver more than one rtp/rtcp streams when using simulcast or repair streams. SVC instead is just one rtp streams and the decoder should choose which temporal/spatial/quality layers to decode (all or only some of them also based cpu constraints)
- The Sender will fully handle (create) the rtp/rtcp streams. It'll contain an encoder and a controller to handle incoming rtcp packets (nack, pli, fir, remb etc...). A sender could send more than one rtp stream when using simulcast or repair streams. SVC instead is just one rtp streams and the encoder should choose which temporal/spatial/quality layers to encode (all or some of them also based on cpu constraints).
The rtpreceiver/rtpsender decoders/encoders could also provide a feedback to an external controller to handle global decisions (like congestion control, global bandwidth estimation) and can be externally tuned based on these/other decisions (vp9/av1 svc layers setup, changing the current encoing bandwidth, chaning the current decoding simulcast stream or svc layers etc...). For example, we have limited upload bandwidth, and we want to split it between N senders.
MediaTrack
api will be the one defined in the webrtc api with additional methods to Read/Write raw data (frames, audio etc...).
- The rtpreceiver/rptsender will work like current pion/webrtc v2. They will directly provide their rtp/rtcp streams that could be read/written by the user (needed for sfu etc...)
RawTrack
will have properties containing the ID/Label and every track rtp stream will also have properties like ssrc, rid. These will be populated by the receiver when receiving or will be used by the sender in raw mode to choose how to negotiate (standard, simulcast, use repair streams). It'll have a list of RTPStreams
(see simulcast PR #1200).
For reading/writing instead of putting methods on RawTrack
/RawRTPStream
user could just use receiver/sender ReadRTP/ReadRTCP/WriteRTP/WriteRTCP methods.
In raw mode we should choose if the sender will manipulate some rtp packet or not (who will set the right ssrc or mid and streamID header extensions in outgoing packets? The user or the Sender?).
The webrtc standard has apis to define the sender behavior but not the receiver behavior: The AddTransceiver
init options. I.e for simulcast/svc the init.sendEncodings
options are the standard way to define how the sender should encode
These are not enough for our needs since we also have to choose the receiver behavior. For example during negotiation a receiver could be automatically created and we would like to hook into it to set its mode.
I'll add an option to the SettingsEngine
to define a per PeerConnection
default behavior of sender/receivers
If we want to have fine grained behavior selection we should instead manually setup Transceivers adding the preferred Track
type:
If the "track" is a MediaTrack
then they'll work in webrtc default mode, if it's a RawTrack
they'll operate in raw mode.
An SFU will use "raw mode".
When creating a transceiver pass to AddTrack
/AddTransceiverFromTrack
a RawTrack
. The receiver/sender will work in "raw mode". The RawTrack properties will be used for negotiation (standard or simulcast, use repair streams)
When a receiver has negotiated the OnTrack
method will provide a RawTrack
(this requires type casting...) and the receiver.
When creating a transceiver pass to AddTrack
/AddTransceiverFromTrack
a MediaTrack
. The receiver/sender will work in "webrtc spec mode".
When a receiver has been negotiated the OnTrack
method will provide a MediaTrack
(this requires type casting...)
To setup the sender behavior use the standard AddTransceiver
init options to use simulcast/svc. I.e for simulcast/svc the init.sendEncodings
options (or add a custom sender method?)
- 1 review and rewrite pion/webrtc dependent library: rtp rtcp etc..
- 2 rewrite pion/webrtc to v3
- We should make cgo pluggable or configuable if we have to use it
- This can make sfu or other app stable and high performance
Sign up for the Golang Slack and join the #pion channel for discussions and support
If you need commercial support/don't want to use public methods you can contact us at [email protected]