Diagnosing a VISCA camera issue

As I have mentioned before, I’ve been spending a lot of time over the last two years writing code for my local church’s A/V system. (Indeed, I’ve been giving quite a few user group talks recently about the fun I’ve had doing so.) That new A/V system is called “At Your Service”, or AYS for short. I only mention that because it’s simpler to refer to AYS for the rest of this post than “the A/V app”.

Our church now uses three cameras from PTZOptics. They’re all effectively the same model: 30x optical Zoom, Power over Ethernet, with NDI support (which I use for the actual video capture – more about that in a subsequent blog post). I control the cameras using VISCA over TCP/IP, as I’ve blogged about before. At home I have one PTZOptics camera (again, basically the same camera) and a Minrray UV5120A – also 30x optical zoom with NDI support.

Very occasionally, the cameras run into problems: sometimes the VISCA interface comes up, but there’s no NDI stream; sometimes the camera just doesn’t stop panning after I move it. In the early days, when I would only send a VISCA command when I wanted to actually do something, I believe the TCP/IP connection was closed automatically after being idle for an hour.

For these reasons, I now have two connections per camera:

  • One for the main control, which includes a “heartbeat” command of “get current pan/tilt/zoom” issued every three minutes.
  • One for frequent status checking, so that we can tell if the camera itself is online even if the main control connection ends up causing problems. This sends a “get power status” every 5 seconds.

If anything goes wrong, the user can remotely reboot the camera (which creates a whole new TCP/IP connection just in case the existing ones are broken; obviously this doesn’t help if the camera is completely failing in network terms).

This all works pretty well – but my logs sometimes show an error response from the camera – equivalent to an HTTP 400, basically saying “the VISCA command you sent has incorrect syntax”. Yesterday I decided to dedicate some time to working out what was going on, and that’s the topic of this blog post. The actual problem isn’t terribly important – I’d be surprised if more than a handful of readers (if any!) faced the same issue. But I’m always interested in the diagnostic process.

Hack AYS to add more stress

I started off by modifying AYS to make it send more commands to the camera, in two different ways:

  • I modified the frequency of the status check and heartbeat
  • I made each of those also send 100 commands instead of just one
    • First sending the 100 commands one at a time
    • Then changed to sending them as 100 tasks started at the same time and then awaited as a bunch

This increased the rate of error – but not in any nicely reproducible way.

Looking back, I probably shouldn’t have started here. AYS does a lot of other stuff, including quite a lot of initialization on startup (VLC, OBS, Zoom) which makes the feedback loop a lot slower when experimenting. I needed something that was dedicated to just provoking the problem.

Create a dedicated stress test app

It’s worth noting at this point that I’ve got a library (source here) that I’m using to handle the TCP connections and send the commands. This keeps all the knowledge of exactly how VISCA works away from AYS, and makes it reasonably easy to write test code separately from the app. However, it does mean that when something goes wrong, my first expectation is that this is a problem in the library. I always assume my own code is broken before placing the blame elsewhere… and usually that assumption is well-founded.

Despite the changes to AYS not provoking things as much as I’d expected, I still thought concurrency would be the key. How could we overload the camera?

I wrote a simple stress test app to send lots of simple “get power status” requests at the camera, from multiple tasks, and record the number of successful commands vs ones which failed for various different reasons (camera responded with an error; camera violated VISCA itself; command timed out). Initially, I found that some tasks had issued far more requests than others – this turned out to be due to the way that the thread pool (used by tasks in a console app) starts small and gradually expands. A significant chunk of the stress test app is given over to getting all the tasks “ready to go” and releasing them all at the same time.

The stress test app is configurable in a number of ways:

  • The camera IP address and VISCA port
  • Whether to use a single controller object (which means a single TCP connection) or one per task. (The library queues concurrent requests, so any given TCP connection only has a single active command at a time.)
  • How long each task should delay between requests
  • How long each command should wait before timing out
  • How long the test should last overall

At first, the results seemed all over the place. With a single controller, and enough of a delay (or few enough tasks) to avoid commands timing out due to the queuing, I received no errors. With multiple controllers, I saw all kinds of results – some tasks never managing a single request, others with just a few errors, and lots of timeouts.

Oh, and if I overwhelmed the camera too much, it would just reset itself during the test (which involves it tilting up to the ceiling then back down to horizontal). Not quite “halt and catch fire”, but a very physically visible indication of an error. I’m pretty confident I haven’t actually damaged the camera though, and it seems fairly reasonable for it to reset itself in the face of a sort of DoS attack like this stress test.

However, I did spot that I never had a completely clean test with more than one task. I wondered: how low could I take the load and still see problems?

Even with just two tasks, and a half second delay between attempts, I’d reliably see one task have a single error response, and the other task have a command timeout. Aha – we’re getting somewhere.

On a hunch (so often the case with diagnostics) I added console output in the catch blocks to include the time at which the exception was thrown. As I’d started to suspect, it was the very first command that failed, and it failed for both tasks. Now we’re getting somewhere.

What’s special about the first command? We’ve carefully set the system up to start all the tasks sending commands at exactly the same time (as near as the scheduler will allow, anyway) – they may well get out of sync after the first command, as different responses take slightly different lengths of time to come back, but the first commands should be sent pretty much simultaneously.

At this point, I broke out Wireshark – an excellent network protocol analyzer, allowing me to see individual TCP packets. Aside from anything else, this could help to rule out client-side issues: if the TCP packets I sent looked correct, the code sending them was irrelevant.

Sure enough, I saw:

  • Two connections being established
  • The same 5 bytes (the VISCA command: 81 09 04 00 ff) being sent on each connection in two packets about 20 microseconds apart.
  • A 4 byte response (a VISCA error: 90 60 02 ff) being sent down the first connection 10ms later
  • No data being sent down the second connection

That explained the “error on one, timeout on the other” behaviour I’d observed at the client side.

Test the other camera

Now that I could reliably reproduce the bug, I wondered – what does the Minrray camera do?

I expected it to have the same problem. I’m aware that there are a number of makes of camera that use very similar hardware – PTZOptics, Minrray, SMTAV, Zowietek for example. While they all have slightly different firmware and user interfaces, I expected that the core functionality of the camera would be common code.

I was wrong I have yet to reproduce the issue with the Minrray camera – even throwing 20 tasks at it, with only a millisecond delay between commands, I’ve seen no errors.

Report the bug

So, it looks like this may be a bug in the PTZOptics firmware. At first I was reluctant to go to the trouble of reporting the bug, given that it’s a real edge case. (I suspect relatively few people using these cameras connect more than one TCP stream to them over VISCA.) However, I received encouragement on Twitter, and I’ve had really good support experiences with PTZOptics, so I thought it worth having a go.

While my stress test app is much simpler than AYS, it’s definitely not a minimal example to demonstrate the problem. It’s 100 lines long and requires a separate library. Fortunately, all of the code for recording different kinds of failures, and starting tasks to loop for a while, etc – that’s all unnecessary when trying to just demonstrate the problem. Likewise I don’t need to worry about queuing commands on a single connection if I’m only sending a single command down each; I don’t need any of the abstraction that my library contains.

Instead, I boiled it down to this – 30 lines (of which several are comments or whitespace) that just sends two commands “nearly simultaneously” on separate connections and shows the response returned on the the first connection.

using System.Net.Sockets;

string host = "192.168.1.45";
int port = 5678;

byte[] getPowerCommand = { 0x81, 0x09, 0x04, 0x00, 0xff };

// Create two clients - these will connect immediately
var client1 = new TcpClient(host, port);
client1.NoDelay = true;
var stream1 = client1.GetStream();

var client2 = new TcpClient(host, port);
client2.NoDelay = true;
var stream2 = client2.GetStream();

// Write the "get power" command to both sockets as close to simultaneously as we can
stream1.Write(getPowerCommand);
stream2.Write(getPowerCommand);

// Read the response from the first client
var buffer = new byte[10];
int bytesRead = stream1.Read(buffer);
// Print it out in hex.
// This is an error: 90-60-02-FF
// This is success: 90-50-02-FF
Console.WriteLine(BitConverter.ToString(buffer[..bytesRead]));

// Note: this sample doesn't read from stream2, but basically when the bug strikes,
// there's nothing to read: the camera doesn't respond.

Simple but effectively – it reliably reproduces the error on the PTZOptics camera, and shows no problems on the Minrray.

I included that with the bug report, and have received a response from PTZOptics already, saying they’re looking into it. I don’t expect a new firmware version with a fix in any time soon – but I hope it might be in their next firmware update… and at least now I can test that easily.

Conclusion

This was classic diagnostic work:

  • Go from complex code to simple code
  • Try lots of configurations to try to make sense of random-looking data
  • Follow hunches and get more information
  • Use all the tools at your disposal (Wireshark in this case) to isolate the problem as far as possible
  • Once you understand the problem (even if you can’t fix it), write code designed specifically to reproduce it simply

Hope you found this as interesting as I did! Next (this weekend, hopefully): displaying the video output of these cameras on a Stream Deck…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s