<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Discovery / Autoconf</title>
<h2>TCF UDP Discovery</h2>
Designed by Eugene Tarassov (firstname.lastname@example.org)<br>
Documented by John Cortell (email@example.com)<br>
Created on: December 5, 2009<br>
Last modified on: June 03, 2011<br>
TCF Discovery for agents connected to a network is done via UDP. By default, a static, well-known (to TCF agents) port is used: 1534.
However, we will see that agents on a machine can participate in discovery even if that port is unavailable; all that is required is
that at least <i>one</i> agent on the network is listening on the well-known port.
Fundamentally, TCF discovery is based on agents advertising their peers by sending UDP packets to other agents, both proactively and reactively.
This concept is simple enough when thinking in terms of one agent per machine. However, there can be multiple agents per machine and that
significantly complicates the discovery mechanism as there can only be one process listening on a given port for a given protocol (UDP or TCP).
That means only one agent per machine (NIC, actually) can service discovery requests on the well known port. How do other agents running on that
machine get in the game? The TCF discovery mechanism addresses this using the notion of master/slave. Basically, the first agent to acquire
the well-known port (1354) becomes the master. All other agents on that machine become slaves.
The master/slave relationship in TCF discovery is not what one might intuitively think, though. Masters and slaves are not all that different
from each other. A slave is not exclusively reliant on the master on its machine. It can rely on any master on the network (subnet, actually;
more on this later). Moreover, a slave's reliance on masters is limited. Ultimately a slave agent ends up establishing a <i>direct</i> relationship
with every other slave on the network, and of course has access to the masters through UDP broadcasting (to port 1354). The role of the master is
simply to make it possible for a new-comer slave to initially meet other slaves. Ultimately, every slave meets every other slave, and they all
directly deal with each other thereafter.
This approach may seem a bit strange. A simpler and more intuitive approach would be to have the master on a machine be the representative
for the slaves on that machine. A slave would just communicate locally with its master through UDP unicasting, and the masters would
communicate among themselves through UDP broadcasting, proliferating information about themselves and their slaves. However, such an approach
would make vendors too vulnerable because the master agent is simply the first agent that is able to grab the well-known TCF discovery port.
There are no guarantees that the agent will turn out to be a well-behaved one. What if its discovery logic is buggy? What if it hangs?
What if it supports only version X of the discovery protocol but other agents on that machine rely on version Y?
Any of these scenarios could render other vendors' agents useless on a machine. Since the TCF discovery mechanism does not employ dedicated
reliable masters but rather relies on every agent volunteering for the role, then the mechanism must be robust enough to overcome
poorly-behaved volunteers. That is accomplished by limiting the role of the master and having all slaves know about, and deal with,
all other slaves directly. Slaves do not need to keep track of masters explicitly; an agent can reach any and all masters by broadcasting
to the well-known port (1354). The end result is that a single master agent can make discovery possible for the entire network.
TCF UDP Discovery is limited to IP subnet scope by design. What that means is that an agent will discover only the peers available on
the subnet(s) its machine is on. If an agent is running on a machine with two NICs, each on a seperate subnet, then that agent will discover
peers on both subnets. However, such an agent will not act as a bridge for discovery across the two subnets. That is, if machine A is on subnet X only,
machine B is on subnets X and Y, and machine C is on subnet Y only, agents on B will know the available peers on subnets X and Y,
but agents on A will not know about peers on Y and agents on C will not know about peers on X.
For simplicity, further discussion on TCF discovery will just refer to agents "on the network", but keep in mind that subnet scoping is implied.
<h3>Rules of Engagement</h3>
Fundamental to the discovery mechanism is the notion that agents must work together to keep each other in sync regarding available agents and peers.
More specifically, agents must be <i>inquisitive, responsive </i>and <i>proactive</i> in sharing its knowledge of peers and agents.
An agent is <i>inquisitive</i> by asking other agents for their tables. It is <i>responsive</i> by providing their own tables when asked.
Finally, it is <i>proactive </i>by volunteering that information even when not asked. This last element is necessary in order to quickly proliferate
new information, without requiring constant polling by agents.
Of course, all this activity back and forth between many agents could easily get out of hand without having well designed <i>rules of engagement</i>.
A reasonable balance of all three activities must occur. Too much interaction will lead to excessive network traffic and agent load.
Too little will prolong stale/inaccurate information. The rules of engagement will be discussed shortly.
The final element of discovery is validation. As there are no notifications in TCF discovery for a peer or agent going away, agents must regularly
validate their tables for accuracy and prune accordingly. This is also covered in the rules of engagement.
The following are the TCF UDP Discovey <i>Rules of Engagement. </i>Unless explicitly qualified, "agent" refers to both masters and slaves:
Agents should, on launch:
broadcast a request for peers to all master agents
broadcast descriptions of its peers to all master agents
Upon receiving a packet (any) from an agent, the receiving agent should ask the sender for its slave table, but should make such a request
only occasionally for any given subnet. In other words, an agent should periodically ask another agent (any agent) for its slaves.
Once an agent receives a CONF_REQ_SLAVES from a slave, for duration of X seconds (retention period, see rule 5), the agent should thereafter
provide that slave its slave table with every subsequent CONF_REQ_INFO it receives from that slave, and it should also proactively provide that
slave info about all new discovered slaves during that time. Essentially, by sending CONF_REQ_SLAVES (rule #2), a slave creates
a temporary coupling between it and that agent (usually a master), where the agent (master) proactively updates the slave's slave table.
An agent should perpetually and periodically notify all known agents with the collection of peers it makes available. Master agents should additionally,
but selectively, advertise peers from other agents. A master should advertise peers from local slaves to all known agents. A master should advertise
peers from remote agents to local slaves via the loopback address. The main purpose of these additional responsibilities is for discovery to work in
the presence of a firewall, as long as the firewall permits traffic through the well-known port (1354). A good interval for the notifications is 1/4
the <i>retention </i>period (see rule 5). At each interval, an agent should ensure that it has sent at least one UDP packet. If it has nothing to advertise,
it should send a CONF_SLAVES_INFO with no data (just the header). Failure to regularly send something may result in the agent being forgotten by the other agents.
A TCF agent can have nothing to contribute, but still be a consumer of discovery data. A slave agent will receive updated discovery data only if it sends something.
A master agent must actively be involved in discovery; it needs to make its presence known to slaves, but slaves will see it only if it sends something.
Agents should have short term memory when it comes to its knowledge of other slaves and and peers. If it hasn't received a UDP packet from a slave in
X amount of time, it should forget that slave. X is called the <i>retention period. </i>60 seconds is a good, default value. Similarly, if it hasn't
received a CONF_PEER_INFO packet for a peer in X amount of time, and it doesn't have an open channel to that peer (not actively using it),
then it should dispose and forget that peer.
If a slave agent has not received a packet from the local master in a while, then it should attempt to become the master by opening the well-known
port (1534). By definition, successfully opening that port makes that agent the master on that machine.
An agent is introduced to other slaves either by receiving another agent's slave table (CONF_SLAVES_INFO) or by receiving a packet directly from the slave.
When an agent becomes aware of a new slave, it should immediately:
ask the new slave for its peers
tell the new slave about all known peers and slaves
advertise the new slave to all slaves that have sent CONF_REQ_SLAVES to this agent within the last X seconds (retention period)
On shutdown, agent may broadcast CONF_PEERS_REMOVED to all local networks and known slaves.
All TCF discovery UDP packets have an eight byte header:
bytes 0, 1 and 2 are 'T', 'C', and 'F', respectively
byte 3 is the discovery protocol version, currently 2, in ASCII
byte 4 is the packet type
1 = CONF_REQ_INFO
2 = CONF_PEER_INFO
3 = CONF_REQ_SLAVES
4 = CONF_SLAVES_INFO<br>
5 = CONF_PEERS_REMOVED<br>
bytes 5-7 are reserved for future use<br>
All additional data is packet-type specific. All strings in that data are UTF-8 encoded.
This is a request for peers info. A recipient of this request should respond by sending back (to whomever sent the request) a CONF_PEER_INFO for every peer
it wishes to advertise. This packet has no additional bytes besides the header.
This is a packet which advertises a peer. An agent will send this reactively in response to a CONF_REQ_INFO request or proactively as dictated by the rules
of engagement. Following the eight-byte header, this packet contains data that describes one and only one peer. The description is simply the attributes of
the peer (IPeer.getAttributes()), in the format <i>key</i>=<i>value</i>, with each attribute terminated by a zero byte.
This is a request for slaves info. A recipient of this request should respond by sending back (to whomever sent the request) one or more CONF_SLAVES_INFO
packets to relay the recipient's slave table. This packet has no additional bytes besides the header.
This is a packet which advertises one or more slaves. An agent will send this reactively in response to a CONF_REQ_SLAVES request or proactively as dictated
by the rules of engagement. Following the eight-byte header, this packet contains data that describes one or more agents. The format of an agent description is:
<i>ttl</i> - time to live - indicates milliseconds until the data expiration time. This number must be in decimal format.<br>
<i>port</i> - the port the agent is listening on (for TCF discovery). This number must be in decimal format.<br>
<i>host</i> - the IP address or host name the agent is listening on.<br>
<i>zero-byte</i> - single byte with value 0.<br>
This indicates knowledge of a TCF slave agent running at 192.168.0.1, listening on port 1940, and that the data should be cached for no more then 30 seconds.<br>
Multiple such descriptions can appear in a CONF_SLAVES_INFO packet. Agents should take care to not send excessively large UDP packets, though.
The TCF reference implementation keeps packets under 1500 bytes. Agents should split large tables across multiple UDP packets.
This is a packet which notifies removal of peers. When an agent is about to exit, it may send the packet to notify others about removing of peers that the agent implements.
This allows other agents to remove stale peer info right away instead of waiting until the data become expired. Following the eight-byte header, this packet contains
sequence of zero-terminated peer IDs.
Frequency of discovery activity is just as important as the rules of engagement but are harder to define. The sweet spot for any particular proactive
operation is subject to agent and network factors. We will here list values used by the Java TCF reference implementation hosted in the Eclipse TCF project.
This is the maximum amount of time an external peer and slave should be remembered barring additional confirmation of existence. 60 seconds.<br>
<i>Temporary Master-Slave Coupling</i><br>
By sending CONF_REQ_SLAVES (rule #2), a slave S creates a <i>temporary</i> coupling between it and the receiving agent A (usually a master),
where A becomes responsible for proactively updating S's slave table. This temporary coupling lasts X seconds, where X = retention period (60 seconds)<br>
Agents should proactively send peer info packets (see rules of engagement 4). It should do so periodically, and regardless of incoming activity
(as opposed to being triggered by incoming packets). It should at the same time update its subnet list, as users on a machine can
enable/disable/reconfigure NICs on the fly. Finally, it should perform housekeeping on its peer and slave table. 1/4 retention period (15 seconds).<br>
An agent that has not received a packet from the local master agent for more than 1/2 retention period (30 seconds) should attempt
to open the well-known port and become the master. It should do this check during its <i>periodic activity.</i><br>
<i>Slave Table Requests</i><br>
An agent should occasionally take an incoming packet as an opportunity to ask <i>any</i> other agent on the network for its slave table
(see rule of engagement 2). The minimum wait time on making such a request depends on the nature of the agent who's packet was just received:<br>
agent is a slave (on this or another machine): 2/3 retention period (40 seconds)
agent is the master of another machine: 1/2 retention period (30 seconds)
agent is the master of this machine: 1/3 retention period (20 seconds)
Basically, slaves should be engaged less often than masters, and remote masters less than local masters.<br>
<h3>Miscellaneous Implementation Considerations</h3>
1. Because a master agent not only responds to peer requests (on port 1354) but broadcasts them (to port 1354), it will receive its own requests.
An agent should check and ignore any packets it has sent.
2. The TCF C implementation of the locator service uses slightly different symbolic names for the packet types:
<table border=1 bordercolor=#000000 cellpadding=3 cellspacing=0 height=117 id=tfq3 width=317>
<th style=TEXT-ALIGN:center width=50%>
<th style=TEXT-ALIGN:center width=50%>