Smart Feeds

Copyright (C) 2011 Livefyre, Inc.

Introduction

Social web applications often produce and consume streams of activity data, such as status updates, stories, comments, likes, etc. For example, take the social commenting service, Livefyre. The feed pattern is so common that it comes up at least five times within that service alone:

  • Participation in a commenting widget (retrieving existing comments, listening for updates)
  • Display of activity on a user’s profile page
  • Users tracking conversations for events relevant to them (e.g. someone liked one of the user’s comments)
  • Synchronizing comments from our cloud service back to a blogger’s personal database
  • Searching for comments

However, the protocols used to exchange data streams such as these are often proprietary and reinvented every time there is a new use case. What if we could come up with a generalized protocol to handle all of these use cases and more? Especially as social networking services begin to federate, it becomes important to have mechanisms for data stream synchronization that are both easy to understand and easy to implement, yet are also efficient and reliable. Smart Feeds aim to provide exactly this. The word “feed” is used here because the synchronization model resembles that of Atom and RSS feeds.

This specification provides an abstract definition of a Smart Feed that is format and transport neutral, followed by a more concrete protocol definition that utilizes HTTP, XMPP, JSON, and Atom.

Preliminaries

Principles

  • We want a general solution that covers many activity consumption use-cases.
  • It is not necessary to provide a change log in order to support synchronization.
  • Simplicity is more important than efficiency. This is a tenet to successful and widely adopted protocols.

Requirements

  • It must be possible for servers to operate statelessly. Subscription maintenance should be the extent of statefulness.
  • It must be possible to push data to interested consumers as it happens.
  • It must be possible for long polling to be used in a pull manner, meaning a client may request for a range of items, and the server may either reply immediately or hold until new items arrive. A held long poll from a pull request is considered to be a pull action, not a push action.
  • It must be possible for a consumer to efficiently mirror a remote feed.
  • It must be possible to adapt the protocol to different formats (e.g. XML, JSON) and different transports (e.g. HTTP, XMPP).

As Abstract

A Smart Feed is comprised of an ordered list of items. Each item in the feed is a whole object that can stand on its own, and is not a change or delta referring to a previous object. If an item is updated, and the feed sorts items by modified time, then the whole item moves to the front of the feed. Consumers can perform synchronization by comparing new whole objects with existing cached copies, and doing whatever action they consider to be the “least wrong” given what they know about the feed. It’s really that simple, and very similar to how an RSS feed works. An RSS client, for example, will poll the same feed document over and over, but it will only present items to the user that it has not seen already. This concept should be much easier to digest, implement, and actually get right, than one based on a stream of deltas.

To say that something is a Smart Feed means it adheres to the following rules:

  • A feed presents a flat, non-hierarchical list of items with a consistent ordering.
  • Items are entire object values, not merely changes to objects.
  • All items have a unique id within a feed.
  • Feeds do not expire items.
  • Items are never truly deleted, but stripped of content and marked as deleted.
  • Different feeds may exist for the same data, each with their own orderings or filters.
  • Filtering means deciding whether an item should appear in a feed or not, or if an item should be transformed in some way before delivering to the consumer.
  • The same feed may be presented differently depending on who is consuming it, but only by filtering, not by reordering.

Additionally, there must be some way to actually interface with the feed. At minimum, a Smart Feed MUST offer a pull interface and SHOULD offer a push interface.

Pull Interface

A pull interface uses request/response interactions to allow a client to fetch a set of items from the feed based on range parameters. The following parameters are defined:

Parameter Explanation
since Limit response to items positionally after this position spec (see below).
until Limit response to items positionally before this position spec (see below).
max Limit response to this many items at most. If unspecified, the server SHOULD have a default maximum.

Not surprisingly, position specifications are used to specify positions in the feed. There are several types, and the recommended syntax is a string consisting of the type, a ‘:’ character, and a value specific to the type, concatenated together. For example, “id:1” would refer to the item with id 1.

The following position specs are defined:

Type Explanation
cursor Points to an item using a server-generated string token. This is a server-specific value based on item ordering, and unique by item. Clients MUST treat cursor values as opaque. Cursors SHOULD be long-lived (that is, they should work after long periods of time and not be associated with a particular “session”), and processing SHOULD gracefully degrade if the cursor format ever changes. A feed MUST support the cursor type.
id Points to the item with this id. Support for this type is OPTIONAL, and only makes sense for feeds where items don’t change position.
time Points to an item based on timestamp. Unlike other position specs, when the time type is used to retrieve items, the results MUST be inclusive (i.e. not only return items whose timestamps are after or before, but also whose are equal). Support for this type is OPTIONAL, and only makes sense for feeds ordered by time.

Although every feed MUST support the since, until, and max parameters, feeds MAY define other parameters.

After receiving a pull request, the server responds with a list of items constrained to the limits specified in the request. A cursor value MUST be provided for the last item returned, and MAY be provided for other items.

Push Interface

A push interface is used to allow feed servers to notify clients whenever an item is added or has changed. It is essentially the equivalent of pulling from a feed that is ordered by modified time, although other behaviors are possible (e.g. the equivalent of pulling from a feed that is ordered by created time could be achieved if a push subscription were configured to only notify about newly created items). Push notifications MUST be delivered in order. Servers SHOULD use delivery mechanisms that have a high degree of success. Servers SHOULD NOT queue outbound notifications indefinitely, though, and this means delivery is not 100% guaranteed.

A cursor value SHOULD be provided for the last item in a push, and MAY be provided for other items. The push SHOULD also contain the cursor value of the last item in the feed prior to this update, known as the “previous” item. The intent is that this item is the one that would have immediately preceeded the first item in the push. Be aware that this can only be relied upon if the subscription configuration and feed ordering match up. The previous item cursor, as well as the last item cursor, are needed to support synchronization.

Updates/Synchronization via Pull

With pull, synchronization is performed by fetching any items since the latest item received. The client should repeatedly poll a feed that is ordered by modified time, providing the cursor of the last item from the previous response as the since value.

Feeds expecting synchronization to occur completely through pull requests SHOULD offer a long polling interface, whereby if the server has no data to respond with then it should hold the request open until data becomes available.

Updates/Synchronization via Push

Even though pushes are more efficient for delivering updates than pull requests, they are not as robust. If a synchronization solution is needed based around push, then pull requests must still be used periodically to ensure feed integrity. The strategy is as follows.

First, bootstrap the process:

  1. Subscribe to a feed that is ordered by modified time.
  2. Perform an initial pull request to fetch the most recent item. Let L be the cursor of the most recent item.

Then:

  • When a push is received, client ensures that the previous cursor matches L. If it does match, then the items are processed, and L is set to the cursor of the last item received in the push. If it does not match, the client should perform a pull request since L, repeatedly if necessary, until the client is fully caught up. Note: if a client can determine that the push is old (for example, if the first item contained in the push has a modified time that is earlier than the latest known modified item), then it can simply be discarded and a followup pull is not needed.
  • If a long period of time passes and there has been no pushes, the client should perform a pull request since L, just to be sure nothing has been missed.

In Reality

While any number of formats and transports could be used to meet the abstract requirements of a Smart Feed, this specification recommends the following HTTP-based and XMPP-based protocols be used when implementing and offering such feeds.

Position Specs

When providing position specs, the string syntax “{type}:{value}” MUST be used.

HTTP Pull

Items are fetched via HTTP GET. The query string MAY contain any of the following parameters:

Parameter Explanation
since See Pull Interface.
until See Pull Interface.
max See Pull Interface.
callback Javascript function to wrap the results in (for use with JSON-P).
timeout Number of seconds to wait before returning, if data is not available (for use with long polling). Default 55 if unspecified. A wait of 0 means to return immediately.

The feed itself is referenced via the remainder of the URL, which can even include other query string parameters as long as they don’t conflict with any of the ones above. This is called the base URL. The base URL scheme must be defined by the feed (see Feed Requirements).

For example, suppose a feed is located at “http://example.com/articles/1/comments/?order=created”. A request to pull items might look like:

GET /articles/1/comments/?order=created&max=50 HTTP/1.1
Host: example.com
Accept: application/json

The HTTP response MUST either be a JSON ActivityStreams Collection object (content type “application/json”), an Atom XML document (content type “appliation/atom+xml), or Javascript (content type “application/javascript”). The latter is for JSON-P, where the response is ActivityStreams or Atom wrapped in code. Even if JSON-P is indicated with the callback parameter, the Accept header can still be used to select from ActivityStreams or Atom.

For an ActivityStreams response, items are contained within a Collection object. Set count to the number of items in this response, totalItems to the number of items in the feed, and url to the base URL of the feed. Extension field last_cursor is set to the cursor value of the last item returned. Extension field next MAY be provided, which contains a fully constructed URL to the next page of the feed. For example:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
    "count": 50,
    "totalItems": 427,
    "url": "http://example.com/articles/1/comments/?order=created",
    "last_cursor": "8f4a1546",
    "next": "http://example.com/articles/1/comments/?order=created&since=cursor%3A8f4a1546&max=50",
    "items": [
        ...
        {
            ...
            "id": "b204edf6-7a03-11e0-8617-837f7e2798e8"
            ...
        }
    ]
}

For an Atom response, items are contained within a feed document. Set the element total (in the http://fanout.org/protocol/atom namespace) to the number of items in the feed. Set the element last_cursor (in the http://fanout.org/protocol/atom namespace) to the cursor value of the last item returned. Set the document’s rel=”self” link to the base URL of the feed. A rel=”next” link in the document MAY be provided, which contains a fully constructed URL to the next page of the feed. Each entry also MUST have a fo:id value. The Atom entry id MAY match fo:id or be derived from it (e.g. to make it a URI). Example response:

HTTP/1.1 200 OK
Content-Type: application/atom+xml; charset=utf-8

<?xml version="1.0" encoding="UTF-8"?>
<feed
    xmlns="http://www.w3.org/2005/Atom"
    xmlns:fo="http://fanout.org/protocol/atom">
  <link
      rel="self"
      type="application/atom+xml"
      href="http://example.com/articles/1/comments/?order=created"/>
  <link
      rel="next"
      type="application/atom+xml"
      href="http://example.com/articles/1/comments/?order=created&since=cursor%3A8f4a1546&max=50"/>
  <fo:total>427</fo:total>
  <fo:last_cursor>8f4a1546</fo:last_cursor>
  ...
  <entry>
    ...
    <fo:id>b204edf6-7a03-11e0-8617-837f7e2798e8</fo:id>
    ...
  </entry>
</feed>

Note: there is no equivalent to the count value from ActivityStreams, but it is not very useful anyway.

HTTP Push

PubSubHubbub subscriptions are used to register URLs to be called when there are updates to the feed. Support for this is OPTIONAL. If supported, the hub URL is defined by the feed specification. The topic URL is the same as the feed’s base URL (normally used for pulling), though the query string parameters may differ depending on configuration. For example, the hub parameters might be set as such:

Parameter Value
hub.callback http://mysite.example.com/push-handler
hub.mode subscribe
hub.topic http://example.com/articles/1/comments/?order=created
hub.verify async

Here’s an example of a subscription request:

POST /articles/1/comments/hub/?order=created HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded

hub.callback=http%3A%2F%2Fmysite.example.com%2Fpush-handler&hub.
mode=subscribe&hub.topic=http%3A%2F%2Fexample.com%2Farticles%2F1
%2Fcomments%2F%3Forder%3Dcreated&hub.verify=async

The server will then perform a verification request to the client’s callback URL, per the PubSubHubbub protocol.

Feed updates are pushed using normal content distribution, using an Atom feed document containing one or more items as entries. The document SHOULD contain the fo:total value, and SHOULD NOT contain a rel=”next” link. Additionally, it SHOULD contain an fo:prev_cursor value, set to the cursor value of the last item in the feed prior to the feed update, and SHOULD contain an fo:last_cursor value, set to the cursor value of the last item returned.

Example of update notification:

POST /push-handler HTTP/1.1
Host: mysite.example.com
Content-Type: application/atom+xml

<?xml version="1.0" encoding="UTF-8"?>
<feed
    xmlns="http://www.w3.org/2005/Atom"
    xmlns:fo="http://fanout.org/protocol/atom">
  <link
      rel="self"
      type="application/atom+xml"
      href="http://example.com/articles/1/comments/?order=created"/>
  <fo:total>427</fo:total>
  <fo:prev_cursor>f4ac5920</fo:prev_cursor>
  <fo:last_cursor>64fc10d2</fo:last_cursor>
  ...
  <entry>
    ...
    <fo:id>14f687b2-7a04-11e0-bc61-9f37d1b52ea7</fo:id>
    ...
  </entry>
</feed>

Note: PubSubHubbub is used as a subscription and delivery protocol only. It is not used for receiving publish requests, as this is implied by the feed, nor is the hub expected to be useful as a generic hub capable of tracking remote feeds.

XMPP Pull/Push

The Publish-Subscribe (XEP-0060) and Result Set Management (XEP-0059) protocols are used to provide pull and push access. The JID and node to use must be defined by the feed (see Feed Requirements). If the feed defines additional parameters, they are to be appended to the node name in a query string-like format. If the node name scheme already uses a query string-like format, then that space should be shared. Position specs are used for the Result Set Management UIDs. The server always replies with UIDs using the cursor type. The client can submit a UID containing any type (e.g. id or time). Items are always formatted as Atom entries. The Atom entry id MAY match the item id or be derived from it (e.g. to make it a URI).

For pull requests, the desired value for since is passed as the RSM after value, the desired value for until is passed as the RSM before value, and the desired value for max is passed as the RSM max value.

Here’s an example pull request:

<iq type="get" from="alice@example.net/1" to="article_1@example.com" id="1">
  <pubsub xmlns="http://jabber.org/protocol/pubsub">
    <items node="comments?order=created"/>
    <set xmlns="http://jabber.org/protocol/rsm">
      <max>50</max>
    </set>
  </pubsub>
</iq>

The server responds:

<iq type="result" from="article_1@example.com" to="alice@example.net/1" id="1">
  <pubsub xmlns="http://jabber.org/protocol/pubsub">
    <items node="comments?order=created">
      ...
      <item id="b204edf6-7a03-11e0-8617-837f7e2798e8">
        <entry xmlns="http://www.w3.org/2005/Atom">
          ...
        </entry>
      </item>
    </items>
    <set xmlns="http://jabber.org/protocol/rsm">
      <first index="0">cursor:cdc811df</first>
      <last>cursor:8f4a1546</last>
      <count>427</count>
    </set>
  </pubsub>
</iq>

Cursors MUST be provided in the RSM first and last values. The RSM count contains the total items in the feed.

Requesting the next page:

<iq type="get" from="alice@example.net/1" to="article_1@example.com" id="2">
  <pubsub xmlns="http://jabber.org/protocol/pubsub">
    <items node="comments?order=created"/>
    <set xmlns="http://jabber.org/protocol/rsm">
      <max>50</max>
      <after>cursor:8f4a1546</after>
    </set>
  </pubsub>
</iq>

For push, an entity subscribes to the node (which may contain dynamic parameters). For example:

<iq type="set" from="alice@example.net/1" to="article_1@example.com" id="3">
  <pubsub xmlns="http://jabber.org/protocol/pubsub">
    <subscribe node="comments?order=created" jid="alice@example.net/1"/>
  </pubsub>
</iq>

While “order” makes little sense in the context of push, the ”?order=created” portion of the node name could mean to only push out notifications for newly added items, which would give it similar behavior to a feed ordered by created time. Any such meaning is defined by the feed specification.

Once subscribed, the server can push message notifications. Messages SHOULD contain a header (using Stanza Headers and Internet Metadata (XEP-0131)) called “TotalItems” set to the number of items in the feed, a header called “PreviousCursor” set to the cursor value of the last item in the feed prior to the feed update, and a header called “LastCursor” set to the cursor value of the last item returned. Here’s an example message:

<message from="article_1@example.com" to="alice@example.net/1" type="headline">
  <event xmlns="http://jabber.org/protocol/pubsub#event">
    <items node="comments?order=created">
      <item id="14f687b2-7a04-11e0-bc61-9f37d1b52ea7">
        <entry xmlns="http://www.w3.org/2005/Atom">
          ...
        </entry>
      </item>
    </items>
  </event>
  <headers xmlns="http://jabber.org/protocol/shim">
    <header name="TotalItems">427</header>
    <header name="PreviousCursor">f4ac5920</header>
    <header name="LastCursor">64fc10d2</header>
  </headers>
</message>

Cursors

Cursor values are used to refer to positions in a feed that should remain usable even if items change position. This specification makes no mandate as to how the cursors should be encoded in implementations, but it is RECOMMENDED to use one of the following approaches:

  1. If the server maintains a version log, then version numbers/ids could be used as cursor values.
  2. If the server does not maintain a version log, then the T+O+C encoding (described below) should be used.

The T+O+C encoding (timestamp, offset, checksum) is a simple way to encode both the position of an item as well as the integrity of that position. In the vast majority of situations, the feed will not be modified in a way that would cause subsequent pull requests to get misaligned and lose updates. However, in the rare case that the feed is modified in such a way, then the server can confirm this by checking the T+O+C integrity and falling back to a timestamp based query if the check doesn’t pass. This fallback query may return items that the client has already seen, but only as a one time effect. Therefore, T+O+C is optimized for the common case, with a recovery approach for edge cases.

A T+O+C cursor for an item is a string of the format “{T}_{O}_{C}” where T is the modified time of the item, O is the position of the item relative to the first item with the same modified time, and C is a checksum of all of the item ids with the same modified time from position 0 to the position of this item. The checksum algorithm is CRC32, and the input is a UTF-8 string of item ids joined by ‘_’ (underscore) characters. Please note that for this to work, the server MUST have a deterministic ordering of items that share the same timestamp.

When a server receives a T+O+C cursor, it recalculates the checksum based on the T and O values, and if the result does not match C, then it should treat the position spec as if it were “time:{T}”.

Security Considerations

Authentication is the job of HTTP or XMPP.

Implementations SHOULD restrict access to (or filter) feeds and items based on authorization of the entity performing pulls or receiving pushes.

Feed Requirements

To qualify as a Smart Feed, a feed MUST supply the following information:

  1. One or more access methods:
    • Base HTTP URL for JSON ActivityStreams results
    • Base HTTP URL for Atom results (may be the same as the former, using Accept header)
    • XMPP JID and PubSub node
  2. The supported position specs.
  3. Item format (must be ActivityStreams or Atom compatible, depending on the access methods offered).