How to: Handle Errors

Error tables and strategies


Introduction

This guide covers error handling when interacting with the C2C API. Gracefully handling HTTP errors when something goes wrong is an essential part of integrating with any third-party service, so let the grace begin!

Types of errors

There are an infinite set of things that can go wrong in any application, but let’s divvy them up into four basic buckets:

  • I/O errors: Originate from the hardware on your device, and can happen from a bad read or write.
  • Application errors: Originate from your application code.
  • Network errors: These errors originate in the network stack, and are going to be communicated by your networking library.
  • API errors: Originate in Frame.io’s backend.

Each of these error groups has slightly different considerations when implementing your integration. This guide will mostly focus on the last group of errors — API errors — but will touch on the others when discussing general error handling strategies.

How API errors are returned

There are two main ways errors are returned from the C2C API:

  • Status codes: Many errors are indicated by an HTTP error code.
  • Error messages: Some errors share an error code, and need to be disambiguated by decoding the returned payload.

Error status codes

HTTP status codes are a standard set of numbers used to communicate the overall result of an HTTP request. You can read more about HTTP code meanings in this great resource from Mozilla, or if you prefer feline-based learning frameworks, this page was made for you.

Each Frame.io API endpoint will have an expected status code on a success -- generally 200: OK, 201: Created, or 204: No Content. You can either explicitly check for the correct status, or more generally check that the status code is between 200-299.

If a status code is greater than 399, then it is an error status code. Most API errors will be in the 4XX range (400-499). In general, if an error is not a 4XX error, that error did not originate on our end, but somewhere between your device and our application. One notable exception is 500 - which is an INTERNAL SERVER ERROR. 500 errors mean that something unexpected went wrong in the inner workings of our server code.

Similarly, 404: Not Found may be produced by services / hardware between Frame.io and our backend, even though it is a 4XX status.

If you get an unexpected HTTP status, please let us know!

Error payload schemas

Frame.io has two possible payloads when returning error details: simple and detailed. Different errors return different payloads, so it’s important that your error parsing logic be able to handle both.

Simple error schema

Let’s make an incorrect request with a bad client_secret:

Shell
curl -X POST https://api.frame.io/v2/auth/device/code \
    --include \
    --form 'client_id=Some-Client-ID' \
    --form 'client_secret=bad_secret' \
    --form 'scope=asset_create offline'

Response:

HTTP/2 400
...

{"error":"invalid_client"}

The simple payload is, well, simple. It only has a single field that can be used to look up the error.

Detailed error schema

Let’s make another curl request that does not include authorization to get an idea of how our detailed API errors are structured:

Shell
curl -X POST https://api.frame.io/v2/devices/heartbeat \
    --header 'Authorization: Bearer bad-token' \
    | python -m json.tool

Response:

JSON
{
    "code": 409,
    "errors": [
        {
            "code": 409,
            "detail": "The channel you're uploading from is currently paused.",
            "status": 409,
            "title": "Channel Paused"
        }
    ],
    "message": "Channel Paused"
}

These kinds of errors are defined by their message field. By extracting the message you can determine the error type.

Determining the error type

In general, when parsing Frame.io errors to native error types, you should first check for an error payload as shown above and — if one was not returned — fall back to using the HTTP status code to determine error type.

A basic error handler might look like this:

Python
# Dict of known error codes: native errors.
ERROR_STATUS_MAP = {
    429: SlowDownError,
    ...
}

# Dict of known error messages: native errros.
ERROR_MESSAGE_MAP = {
   "Channel Paused": ChannelPausedError,
   "invalid_client": InvalidClientError,
   "slow_down": SlowDownError,
   ...
}

def _c2c_extract_error_message(response):
    """
    Gets the error message from an error payload. Returns `None` 
    if an error payload is not found.
    """

    # Try to decode the payload, if it is not JSON return `None`
    try:
        payload = response.json() 
    except JSONDecodeError:
       return None

    # Try the simple error schema first.
    message = payload.get("error", default=None)
    if message is not None:
        return message

    # Now try the detailed schema. Return None if we do not find one.
    return payload.get("message", default=None)

def _c2c_error_type_from_response(response):
    """
    Converts a bad HTTP response into an error.
    """
    error_message = _c2c_extract_error_message(response)

    # try to do a lookup of the error type by message.
    error_type = ERROR_MESSAGE_MAP.get(error_message, default=None) 
    if error_type is not None:
        return error_type()

    # If not, try to do a lookup by error code.
    error_type = ERROR_STATUS_MAP.get(response.status_code, default=None)
    if error_type is not None:
        return error_type()

    # Otherwise we are going to return an `UnknownAPIError` to signal that we
    # encoutnered an error from Frame.io's backend servers, but do not know the
    # message and/or status code.
    return UnknownAPIError(message=error_message)

def raise_on_frameio_error(response, expected_status):
    """
    Raises a native error from an HTTP response if the response indicates an error
    occured. Expected status should be the status we expect to get (200, 201, 204, 
    etc).
    """

    # If the status code is less than `400`, then it is not an error status code.
    if response.status < 400:

        # Check that the status code is the one we expected, otherwise raise an
        # error.
        if response.status != expected_status:
            raise UnexpectedStatusError(
                expected=expected_status, received=response.status
            )

        return None

    # Otherwise convert and raise a native error.
    raise _c2c_error_type_from_response(response)

See the table at the end of this article for building your error code and message dictionaries / maps / switches.

AWS errors

When uploading file chunks, you will be interacting directly with AWS, which returns its own errors. Check the page on common AWS Errors. Your integration should be able to retry on non-fatal AWS error codes. See the table below for details. In general, errors returned from AWS should be retried at least once.

AWS errors are formatted as an XML in the response payload, and will look like this:

Xml
<?xml version="1.0" encoding="UTF-8"?>
<Error>
  <Code>NoSuchKey</Code>
  <Message>The resource you requested does not exist</Message>
  <Resource>/mybucket/myfoto.jpg</Resource> 
  <RequestId>4442587FB7D0A2F9</RequestId>
</Error>

The Code tag is what determines the error type.

Retrying Errors

When should I retry an error?

The tables below mark API errors that you should retry on, but you will need to determine what errors from I/O, your networking library, and AWS are good candidates for a retry. A good rule of thumb is to retry errors which may result from a transient state. Many networking errors will fall into this category: the network was temporarily congested, Frame.io’s backend was down, too many packets got lost, etc. etc. Almost every networking library will raise something like a TimeoutError when a request goes too long without a response, and is a good example of an error that should be retried.

Some API errors also fall into this bucket. In the tables at the bottom of this document, we have marked when errors should be retried or when they should be considered fatal.

When in doubt, retry once

It’s fairly settled science that computers are weird. It’s worth retrying a failed request even when you get an error that seems like it should be fatal. Maybe your computer’s CPU was overheated, got hit by a cosmic ray, or triggered a memory bug due to a rare set of conditions. When you encounter a “fatal” error that isn’t explicitly clear, retry it one time before giving up, even if it’s something that seems fatal like a DivideByZero error. Maybe your computer was just in a really weird state.

There are some exceptions to the rule. If you receive a 403: ACCESS FORBIDDEN code from the server when attempting to create an asset, that means that devices have been paused and one should not attempt to re-upload. This error is exceedingly unlikely to be triggered by a bad state, and should not be retried.

Exponential backoff

There is a limit to the number of requests that Frame.io will let you make, and when you exceed it, you will receive EITHER a 429: Slow Down error, or a 400 and a payload like so:

HTTP/2 400

{"error":"slow_down"}

When you receive such an error, you should begin exponential backoff on retries. In general a good formula for determining delay (in seconds) between requests is:

Python
delay = min(2 ** attempt / 2, 32.0)

Where attempt is 0-indexed. This algorithm will produce delays like so: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s, with every request after this waiting 32 seconds between calls.

Backoff jitter

Though not required, we appreciate it when our integrators add jitter to their backoff logic. This stops retry attempts from being “synced” between devices who all experience errors at the same time due to a backend failure. When devices all experience a simultaneous error, it can cause a regular spike in requests that can bring our server to its knees. This is part of the thundering herd problem, and can be mitigated by adding a random offset to your request backoffs to keep your device from issuing retries in sync with every other connected device. We recommend calculating jitter as a random value between 0 and 1/2 the delay value: math.rand(0, delay // 2).

Although you only need to implement exponential backoff for SlowDown errors, it's not a bad idea to use exponential backoff in general when dealing with errors resulting from a network call or I/O failure. Exponential backoff allows for a transient bad situation — like a congested network or over-stressed storage — to clear up without adding to the resource load while exhausting your retry limit.

Detecting disconnected status

When we get a networking error — like a TimeoutError — it's possible that Frame.io has become unreachable because:

  • Your network is down
  • Frame.io is down
  • Some other device between you and Frame.io is experiencing errors

It’s important that you catch these errors. When an error occurs that might be the result of your device being unable to connect to Frame.io, you should launch a task that waits for your device to re-establish communication, and indicate to the user that the connection to Frame.io has been lost.

Waiting for connection and authorization

It’s good to build some logic into your app to wait to make a request if your device is in the middle of refreshing its authorization, waiting for the user to authorize it, or cannot currently reach Frame.io. This helps reduce the number of useless network requests being issued if we know we are in state that cannot complete them.

Unless we are making a call to https://api.frame.io/health, you should block calls from being made if you think you are in a disconnected state. When an error is raised that might be caused by Frame.io being unreachable, a task should be launched that polls https://api.frame.io/health, and blocks all further calls to Frame.io until a successful response is received.

Likewise, if you get an error that indicates your token has expired, you should block all further calls that require authorization until a new access_token has been issued. If refreshing the token also fails, then the user will need to be alerted to sign in while your requests block.

When polling for connection status, you should use the same exponential backoff strategy as discussed above.

Request timeouts

Many networking libraries are pre-configured with either an infinite or very long (10+ minute) timeout when making HTTP requests. We recommend using the following timeout values:

  • Default: 15 seconds. If a basic request takes more than 15 seconds, it’s better to just retry it.
  • Authorization Refresh: 2 minutes. A simple request, but if the request has hit our server before the stalling, it's possible your authorization has been revoked. Giving a little extra time will make it less likely that the user will have to redo the auth process.
  • File Chunk Upload: 5 minutes. It can take a surprisingly long time to upload ~20MB of data on a slow network.

Example retry handler

Let’s look at some python-like pseudocode for handling errors and exponential backoff. When C2C.method is used here, we are calling a theoretical client for interacting with the C2C API that handles things like tracking authorization, polling for health, etc.

Python
# List of errors we know are fatal and should not be retried.
FATAL_ERRORS = (
    ChannelPausedError,
    DevicesDisabledError,
    ...
)

# List of errors we know should be retried more than once.
RETRY_ERRORS = (
    TimeoutError,
    NotFoundError,
    SlowDownError,
    UnknownAPIError,
    ...
)

# List of errors that could be the result of Frame.io being unreachable.
DISCONNECTED_ERRORS = (
    TimeoutError,
    HttpClientError,
    ...
)

def retry_with_backoff(next_handler):
    """
    Middleware for retrying errors with exponential backoff.
    """

    def retry_handler(call, retry_count):
        """
        Handler for retrying c2c API calls with exponential backoff.
        """

        error = None

        # We will retry the call 8 times here, totalling 63.5 seconds +- ~32 seconds.
        for attempt in range(start=1, stop=retry_count + 1):

            # If we are attempting to reach an endpoint that requires authorization
            # we should wait unil we have valid authorization before attempting
            # a call. We need to do this each time in case our access_token
            # expires between attempts.
            C2C.wait_for_authorized(call)

            # Likewise, we should wait until we are connected to Frame.io to attempt
            # a call if we are not calling `https://api.frame.io/health`
            C2C.wait_for_connected(call)

            try:
                # Return the result on a success.
                return next_handler(call)
            except FATAL_ERRORS as error:
                # If we hit an error we know is fatal, raise the error without
                # retrying it.
                raise error

            except RETRY_ERRORS as error:
                # If we hit an error we know we should retry many times, continue,
                # but notify our client if we think we may have been disconnected.
                if type(error) in DISCONNECTED_ERRORS:
                    C2C.notify_disconnected()

            except BaseException as error:
                # Otherwise, do not retry the call more than once.
                if attempt > 1:
                    raise error

            # The delay for the next attempt should be no more than 32 seconds.
            # This algorithm will go: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s, 32s, ...
            delay = min(2 ** attempt / 2, 32.0)

            # Add some randomness (jitter) to the delay (up to half the value of
            # the delay in either direction).
            delay += math.random(-delay, delay) / 2

            # Wait between retries
            sleep(delay)

        # If we have exhausted all retries,
        raise error

    return retry_handler

Error tables

Our API has a number of errors it can return. When handling errors, we can break down errors by their message or status codes as shown in the table below. Here is a quick description of the columns for these tables:

message: The message in the error payload that defines this error.

http code: The HTTP status code that defines the error.

error type: A theoretical error type that the message or status maps to. Error types and their meaning are discussed in-depth in the descriptions section. You may notice in this table that some status codes or messages map to the same Error Type. Different parts of our API will sometimes communicate the same idea (like slow down / back off errors) slightly differently, but map to the same error class.

schema: The payload schema the error message uses. Ether simple or detailed.

retry: Whether the error should be retried, may be one of the following:

  • yes: Retry the error until you hit your retry limit.
  • once: Retry the error once, just in case something went wrong with the networking stack or request encoding.
  • no: This is a fatal error, do not retry at all.

Values with an asterisk (*) next to them have some caveats or details that bear further inspection. Look up the Error Type in the descriptions section to get more information on the value.

Error messages

Here is a table of error messages that should be handled by your application.

MessageError TypeHTTP CodeSchemaRetry
“access_denied”AccessDenied401simpleonce
“authorization_pending"AuthorizationPending400simpleyes
“Channel Paused”ChannelPaused409simpleno
“expired_token”ExpiredToken400simpleno
“Invalid Argument”InvalidArgument422detailedno
“invalid_client”InvalidClient400simpleno
“invalid_grant”InvalidGrant400simpleno
“invalid_request”InvalidRequest400simpleonce
“Not Authorized”UnauthorizedClient401detailedno
“slow_down”SlowDown400simpleyes
“unauthorized_client”UnauthorizedClient401simpleyes*

Status codes

HTTP CodeError TypeRetry
400InvalidRequestonce
401UnauthorizedClientno
422InvalidContentTypeno
429SlowDownyes
500InternalServerErroryes

AWS Errors

See here for AWS error descriptions.

ErrorRetry
InternalErroryes
OperationAbortedyes
RequestTimeoutyes
ServiceUnavailableyes
SlowDownyes
[All Other Errors]once
Parsing similar AWS errors

Both SlowDown and ServiceUnavailable errors indicate you are sending requests too quickly, and can be cast to the same class / error code as SlowDown in our other tables, and should result in exponential backoff as native Frame.io SlowDown errors. Likewise, InternalError can be parsed to the same class as InternalServerError in the above table.

Descriptions

AccessDenied

Returned when authorization is declined during device pairing.

AuthorizationPending

A user has not yet put a device pairing code into Frame.io. Poll again after the interval you received with the device code.

ChannelPaused

The channel was paused during the time the asset was created. Do not attempt to upload this asset again.

Expired Token

The device and user code issued to the user for pairing a hardware device has expired and a new one should be issued.

InternalServerError

Something unexpected went wrong in our backend! Try once more in case it resolves itself. Please report 500’s to us so we can take a look and see what is going wrong.

There are a couple known bugs where an InternalServerError is returned instead of an InvalidRequest. If you get a consistent 500, check to make sure it is not one of our known issues:

  • Trying to upload to a device channel that does not exist.
  • Requesting a custom chunk count for an asset that does not conform to S3's restrictions.

InvalidArgument

An argument in your payload did not contain a valid value. Double-check that your are supplying an expected value with the argument indicated in the error detail. This error is similar to BadRequest.

InvalidContentType

The Content-Type header of your request is not supported. Please supply another content type (or make sure yours is spelled correctly). What content types are accepted is dependent on endpoint. In general, the content types Frame.io supports are:

  • form/multipart (authorization requests only)
  • application/x-www-form-urlencoded (all endpoints)
  • application/json (all non-authorization endpoints)

InvalidClient

The client_id, client_secret, or other client-related value is unrecognized. Make sure your values match what we have on-file in our backend!

InvalidGrant

The authorization grant type is not valid. Please review the authorization guides for grant type values you should be supplying during authorization.

InvalidRequest

The request payload / parameters are malformed. Make sure your fields are all named correctly and values are formatted as expected.

If you get this error when refreshing authorization, your refresh token has expired and you will need to restart the auth process.

SlowDown

You are making requests to this resource too quickly. Start executing exponential backoff. Note that you can trigger slow down errors by making multiple requests for a device code on the same TCP connection. For now device pairing codes should always be made on their own TCP connection, and then the connection should be closed.

UnauthorizedClient

In most cases this error is returned when a user’s an access_token has expired or was not provided. If you get this error, you should refresh your access token, then continue to retry the request. If you get this error while refreshing your access token, you will need to restart the authorization process before retrying the request.

This error can also occur when you are trying to access a resource you are not authorized for, like listing comments. When a project disables C2C devices, it will trigger the error as well. Make sure you are not trying to access a resource that is not part of the C2C API. If you keep getting this error on a C2C endpoint, it’s possible you did not request the correct scopes during authorization or project devices have been disabled on the project.

If this error is returned when attempting to refresh an access token, the authorization process must be restarted. Alert the user that the device / app must be reconnected to a project.

Next up

If you haven’t already, we encourage you to reach out to our team, then continue to the next guide. We look forward to hearing from you!