- Published on ·
- Time to read
- 8 minute read
SDI Capture
- Authors
- Name
- Patrick Hulce
- @patrickhulce
- this post is about SDI capture methods, the most performant way to capture and manipulate video from cinema-quality cameras using AI in real-time
- introduction
- this year I started a new job at Netflix working on the Studio side in applied research
- it's been a really fun return to my roots
- origin story
- back in high school when I first learned to program, I ran a small video production company too
- I was deep in After Effects and Premiere, filming in front of green screens, and doing a lot of comp work
- shoutout to Andrew Kramer and VideoCopilot.net for the best introduction anyone could ask for
- anyway, did a lot of filming and visual effects work alongside programming, thought that's what I might do forever
- went to college for computer science, got deep into software and kind of forgot about video (except for my personal projects and photography of course)
- fast forward to 2024
- I'm back in the game!
- going from dinky conference productions for small businesses in Tennessee 15 years ago to the literal cutting edge of video production at Netflix has been quite the jump
- super super fun and a literal dream come true
- the problem
- part of my responsibilities now involve onset processing and manipulation of video in real-time using AI research
- almost everything onset is Windows or processed with dedicated hardware, not exactly linux / AI research toolchain friendly
- we have the obvious problem of getting the ML inference pipeline to run in real-time (but I alredy know a lot about how to optimize that, which is a whole other post), what about the entire rest of the pipeline getting it from the studio into python-land and back out again?
- i had no idea how to do this 2 months ago, lots of learning
- what follows is an explanation of what I've learned so far and the best way to capture and manipulate video from cinema-quality cameras using AI in real-time
- background info
- video production
- SDI is a video interface standard that is used in many broadcast and professional video production environments
- it's basically the USB of video
- it can carry crazy high resolution and frame rates, uncompressed, up to 12 Gbps
- there's usually splitters and repeaters and all kidns of gear converting the signals from raw camera data to apply creative looks and effects
- there's also tons of in-camera options and effects that can affect the look, latency, and quality of the video
- we want to introduce a step that applies AI transformations to the video in real-time
- the goal is to do this as efficiently as possible with minimal latency, ideally with a single machine
- capture cards
- capture card is a device that sits between the camera and the computer
- it's responsible for converting the raw video signal from the SDI cable into bits for our software to process
- there are many different capture cards, each with their own features and capabilities
- AJA
- defacto standard for broadcast and professional video production
- very high quality, also very expensive
- not exactly linux friendly
- DeckLink by Blackmagic Design
- another popular choice
- also very high quality
- a Linux driver and source SDK are available, plus an ffmpeg plugin!
- Magewell
- another slightly less popular choice
- best linux support of all, built-in v4l2 support (we'll get to that in a minute)
- as plug-and-play as it gets, USB options that work just like a webcam
- still pretty high quality
- capture methods
- irrespective of the card, you also have many options for capturing the video stream itself from that card
- on linux we have several decent options we'll explore
- v4l2
- what your typical webcam uses
- super easy to use, just plug and play
- v4l2-ctl list example
- opencv code example
- ffmpeg
- the swiss army knife of video processing
- several options, including reading from process pipe in python and a companion ffplay command line tool
- bash code example
- python code example reading from stream
- gstreamer
- from nvidia, supposedly to optimize our use case of processing video in real-time
- several options, including reading from process opencv in python and a companion gst-launch-1.0 command line tool
- bash code example
- python code example
- vendor SDKs
- code from each vendor to read the raw video data from the card, usually in C++
- C++ code example of decklink
- colorspaces and pixel formats
- these are super complex and probably deserve a post of their own at some point, but for now, here's the important stuff
- colorspace is the strategy we use for defining the possible values of a color, there's scene referred and display referred, and whole lot of options here but for now just know that there are a range of options here and they can affect latency because of how many bits are used to represent a color
- pixel format is the strategy we use for encoding a color in a particular colorspace, the one you're probably most familiar with is RGB if you have a generic tech background, but video (and jpegs actually) tend to use YUV, this too can affect latency
- video production
- the experiment
- setup
- i wanted to test the "glass-to-glass" latency of the entire pipeline
- set up a stopwatch with highest precision I could find, placed it next to my monitor and pointed the camera at it
- when I bring up the real-time feed on the monitor and take a photo, the difference in timestamps between the photo and the video feed is the latency! BOOM!
- rented two cameras, a RED Komodo, a Blackmagic micro studio 4k, ordinary USB webcam as control
- RED Komodo is a very high quality camera, but also very expensive and heavy
- Blackmagic micro studio 4k is a also a pretty high quality camera, but light and (relatively) cheap
- we want to isolate each component of the pipeline and test them individually as much as possible
- e.g. if RED latency is 200ms but Blackmagic is 100ms, we know the difference isn't the capture card
- constraints:
- I only had about 1 week to build the entire pipeline and make our recommendations before I flew to LA to start working on the project which meant some compromises
- I didn't want to spend more than 2 days on this hardware optimization experiment
- Ubuntu ML inference optimized far better than Windows, so we'll use Linux for the pipeline for now
- That means AJA is out for now, but revisit if we need to
- The entire team knows python but little C++, DeckLink SDK is out too for now, but revisit if we need to
- diagram of the pipeline
- light from camera -> camera sensor -> in-camera processing -> SDI out (camera) -> SDI in (capture card) -> capture card hardware processing -> driver/OS processing -> python/SDK processing
- we'll manipulate the following settings
- camera used, check if it's the bottleneck
- capture card used, determine its latency and best option
- capture+display method, determine best option, understand display latency from opencv render
- FPS, check if it's the bottleneck, obviously higher fps the lower latency of each frame
- resolution, check impact on latency
- in-camera settings (colorspace / lut), check impact on latency
- capture card settings (rgb vs. yuv, mjpg flag), check impact on latency
- the test complications
- the "millisecond precision" stopwatch was not nearly as precise as advertised and reviewed
- I doubt anyone was actually taking high FPS recordings and checking that the latency was in milisecond granularity
- the screen appeared to have a ~30Hz refresh rate, so the latency readings were in ~33ms increments 😭
- fix: do a ton of trials, measure frame delta on larger time scales, and average it all out
- disappointing? yes, but also, we only really care about frame-level delay at 24FPS for cinema, so it still works
- the results
- most comparisons of the form "which card/setting is faster, X or Y" are inconclusive (not enough statistical power to draw conclusion or too many context changes erase or reverse the "gains")
- most important factors are...
- capture resolution (duh, ~60ms for 1080p, ~200ms for 4K, almost 4x to match the pixels you're pushing!)
- FPS (duh, ignoring hardware there's mandatory ~30ms delay from 24 to 60 from physics)
- client/driver choice (v4l2 faster than ffmpeg faster than gst)
- camera itself
- screenshot of giant spreadsheet with all the results
- widget to play around with latency ranges and see the impact on the pipeline
- hypothetical fastest possible configuration:
- 1080p Blackmagic cam @ 60 FPS, Magewell USB Capture card, v4l2, RGB pixel format, no colorspace / LUT transformations
- capture card made little difference
- DeckLink can be faster, but not conclusively so
- surprising given PCI vs USB, part of this is driver related, v4l2 by far the best method and decklink doesn't support it
- both are pretty well optimized
- pixel format not as big a hardware lift as one might hope
- CV2 is pretty good at converting between pixel formats already, max savings of a few ms
- input resolution doesn't really matter
- hardware processing and scaling down in the camera vs capture card not a big deal
- minisculely faster to do in camera
- big difference was hardware vs. software, obviously
- processing from python was sometimes FASTER than vendor-provided C++ options
- GStreamer, the purpose built C++ NVIDIA option was 2x slower than python v4l2 -> opencv display when using magewell card, but the opposite for decklink cards
- never thought that would happen, even if inconsistent across configs!
- conclusion: soooooo much of the performance is dependent on your very specific environment / client choice, very tough to make generalizable statements.
- unfortunately, severely limits the usefulness of this article to you, the reader
- ultimately, just like in code, a lot of these microoptimizations weren't as big of a deal as you think it might be
- you should probably measure your specific case and only care if it's noticeable!
- setup