Monday, September 11, 2006

Where have I been

It's been a long time since my last research post. I haven't taken the time to write about what I'm doing because I'm actually doing it. Over the past year and a half, I've been exploring my research options here at Pitt and settling on an area. I attempted to move the distributed storage project, that used Netnice, to the Planetlab testbed, on which I had to develop my own simple quality of service implementation for the system. I then, as part of a class, got involved in a file distribution system, which I'm still planning on working "in the background" with time I don't have.

More recently, though, I've been involved in wireless networks. I evaluated some routing protocols as part of a class, and I've just been getting into a Delay Tolerant Networking approach over the summer.

Thursday, March 16, 2006

Bandwidth Management Scheme

Normal Operation

For any one particular node in the system, the bandwidth is managed in a fairly simple manner. For each incoming connection, the node fairly allocates as much bandwidth to the connection as it can. Thus, the total capacity of the node's link is equally divided among the number of open connections.

Update and Challenges

When the backup interval elapses, signaling that an update of the node's data needs to occur, the node computes the minimum amount of bandwidth needed for all the open connections from remote nodes. Each of these connections is then throttled down to its minimum, and the rest of the bandwidth is dedicated to sending out the updated blocks or downloading the requested blocks. Once the update or challenge completes, the existing incoming connections are repartitioned to a fair share of the total bandwidth provided the opposite challenge or update is not occurring.

Assumptions and Simplifications

For this to be accurate, each of the incoming connections must have the same minimum bandwidth requirement. If they did not, as the number of connections grew and a fair proportion was divided among each of them, the node would violate the agreement of the connection with the largest minimum requirement first when, instead, it could unequally partition the bandwidth is such a manner that each connection was at or above its minimum.

Presumably, when an update occurs the outgoing connections could be shaped by the remote nodes in such a way that the total capacity of the node's link is not being utilized. In this case, the node should dedicate more bandwidth to the incoming connections rather than just keep them at their minimum requirements until the update finishes.

Monday, March 13, 2006

General Prototype Design

Main

The main thread of the prototype initializes the program, creates a listening socket, and loops forever, waiting for connections and timing events. The loop first calculates how much time it can wait before an event must occur. It then calls select() to poll the listening socket for the specified amount of time, which will block until one of two events occur: a connection is available to accept or a timeout occurred. If a connection is available, the main thread accepts the connection, creates a PeerConnection object around the socket, and spawns a new thread to for the PeerConnection object to process the communication over the object.

If a timeout occurs, there are two cases the main thread must handle. The first occurs if the timeout means that an update interval was reached. In this case, the main thread spawns another thread calling the update() method, which simulates performing an update by creating objects of and spawning threads for the UpdateConnection class. The second case occurs when a challenge session time is reached. In this case, a similar scenario occurs where the main thread spawns another calling the restore() method, which itself spawns multiple threads, each to service a RestoreConnection object that handles the challenge communication with one particular partner.

Bandwidth Management
All of the bandwidth accounting is done in the main(), update(), and restore() functions. While not described below, each spawned connection thread measures its own bandwidth usage and stores the value in a mutex protected shared variable. These three functions then compute the total of all their spawned connections in order to determine network utilization. The scheme for this is not yet fully implemented.

PeerConnection

Each PeerConnection is associated with a single accepted socket. The object loops indefinitely waiting to read data from the socket. The remote host is either trying to store data or retrieve it. Thus, the PeerConnection first reads a single byte t hat determines with case it should expect. If the remote host wishes to store a block, then the PeerConnection loops until it has read in an entire block, which is then "stored to disk," and by that I mean discarded. If the remote host is asking for a block, the Peer Connection generates a block of data and loops until the entire 4KB block has been sent. It then returns to block on a recv() until another byte of data can be read from the socket. When the socket is closed, the PeerConnection is terminated.

UpdateConnection

Each UpdateConnection is associated with one particular partnership. It picks a random number of blocks, connects to the partner, and sends the blocks to the remote host while implementing the simple protocol for the PeerConnection to respond correctly has described above. Once it has completed, it closes the socket and terminates the thread.

RestoreConnection

Each RestoreConnection is, similar to UpdateConnection, associated with one particular partnership. It randomly selects a number of blocks to retrieve from the partner, connects to the host, and asks for blocks one at a time. Once each has been received, it closes the socket and terminates the thread.

For simplicity, it is assumed the remote peer will generate some kind of data to send back, whereas an actual implementation would have to account for the possibility that the host is not there or can't produce the data.

Monday, March 06, 2006

Netnice Reloaded

Netnice provides QoS by setting up what are called VIFs in the /proc/networks directory. Each VIF can be given a weight, priority, and/or bandwidth limitation in order to achieve a network QoS model where the collection of VIFs share the physical network connection.

For the purposes of this experiment, it seems as though we would need a VIF for each connection that a remote peer initiates with a node (in the least) since its the node's responsibility to shape his peers' access to itself. However, it would be best to shape connections the software make as well. Therefore, the software will need to dynamically create and destroy VIFs for each connection it accepts, creates, and terminates. It will further need to be able to set the QoS parameters for each based on established partnership agreements.

I need to find some code samples that do this.... www.netnice.org has none.

Thursday, March 02, 2006

Connection detail

There are two "events" that need to be dealt with in the software: the point in time when an update needs to occur where new data is sent to the partners and the event when a partner connects to the node in order to store data. (We don't need to worry about a host connecting in order to negogiate a parntership just yet).

On an Update

Here, there are several things that need to happen. First a comparison between an old snapshot and the new state of the data needs to be made in order to identify those blocks that need to be updated on other partners. Then, one at a time, a connection needs to be formed with the necessary partners and the data sent while bandwidth is available to support the transfer. In each connection, we need to identify this host to the remote partner, in turn specify the blocks that are going to be sent(given an assumed block size), and send them allowing the remote partner to determine the transfer rate (using netnice).

We model this by assuming a large file of data segmented into blocks, a subset of which are randomly selected and assumed to have changed, and they conveniently reside on a single remote partner. Thus, the software needs to form a single connection with this one partner, and it can allow its IP address to identify itself. It then sends a block ID number of some kind followed by the block itself.

Accepting a connection

When accepting a connection, the software first needs an identification from the remote, connecting partner so as to determine which of its partnership this connection is servicing. Second, it needs to determine if this particular connection is requesting data from it or requesting to store an updated block (again forgoing the possibility that this might be an attempt to negotiate a new partnership). Then, it can expect a block ID number to follow. If it's a request for a restore, the host should send some kind of acknowledgement followed by the block. If not, then the remote partner should immediately begin sending the block to store.

Aside from the file system interface, this needs a full implementation.

Notes

  • It should be the connecting hosts responsibility to terminate all connections. The accepting (serving) host should be prepared to store or restore successive blocks until the connection is terminated.

Sunday, February 26, 2006

Rough Software Spec

I'm hoping Landon Cox will send me the Samsara code soon, but, as was discussed, it's not wise to wait on him. Thus, this is a bare minimum description of what the software prototype needs to do in the event we don't get Samsara as a base from which to work.

File System.

The full scale software system would need some kind of configuration interface where the user could specify the set of directories or disk mounts that the software should back up. From this the software would essentially create a single huge block of addressable data, which would then be split into chunks for distribution. It would then need to be able to track changes made to the given set of directories and map those changes into the single block.

For this prototype, it would suffice to use a single huge file to model the addressable data space the full scale system would create. Some means of modification to this file would then need to be devised. The prototype would then to notice the change and accurately determine the section of the file that has been modified. We discussed the use of Delta encoding for this purpose (we have to get more information about its implementation if Samsara never comes).

At a bare minimum, the prototype could simply maintain several small blocks, a subset of which are assumed to have changed with each backup "epoch."

Questions

Given a set of directories, the prototype flattens them and maps an address space over the result. What then happens as files are added to the directories? Would the flattened block then grows from the middle, shifting all subsequent data to different addresses?

Partnership management.

While Samsara/Pastiche have already tackled the problem of finding partners based on comparable software installations, our system would have to add a service level agreement advertisement and negotiation scheme to this for a full blown implementation of the Hard Data Proposal. This would require the user to give an configurable estimate of how much data changes, how varying that measure is, how quickly it should be backed up, and how "important" it is to get good service, which would direct the system to give up proportionately more resources compared to its needs, making a more desirable offer, and attracting better partners.

At a minimum, the prototype could assume a set of partnerships (as little as one), each with an assumed service level agreement that may or may not be an equal partnership.

Per Partnership needs

For each partnership, the prototype would need to store:

  • From The Partner
    • The service level agreement for the data being stored, which comprises the size of data to store and the minimum bandwidth needed
    • An identifiable tag for each block being stored for the partner (perhaps the starting address and size)
    • The location of all blocks for which the local node is responsible
  • To The Partner
    • The service level agreement for the data being held by the partner
    • An identifiable tag for each block being held by the partner
    • The segments of the address space that the partner is servicing, i.e. storing the equivalent blocks

On an Update or Restore

When a connection is created, the connecting host either wants to update the data stored or retrieve it. The initial communication of the connection should communicate this along with a requested amount of bandwidth for the update or restore and an identification of the blocks to be accessed. The serving host can then grant the connecting host a bandwidth allotment between the minimum agreed upon bandwidth and the requested amount, which could indicate the hosts wants all available.

Responsibilities

The serving host must be able to control this as the update or restore progresses while other connections from other partners are created and severed. Furthermore, it is the responsibility of the serving host to monitor and control this bandwidth because a connecting host greedily wants as much as possible while the serving host only wants to give up the smallest amount of resources to ensure the partnership continues. It is then the connecting host's responsibility to ensure his connecting service level agreement is not being starved.

Questions

The questions the arises from this is the serving host's ability to manage his network connection. Should it only make agreements such that the sum of the minimum bandwidth requirements is no more than its network capacity? Or can it overshoot its capacity relying on the fact that not all partners will be using the connection at once? If so, can it then deny or delay a connection based on preoccupied resources? Is this then grounds for termination of a SLA? If not, how frequently can it occur before a SLA should be terminated? (While this would be interesting to research, it would require a large number of hosts and an significantly long experiment. So what assumptions can we make?)

Update Frequency

Based the user settings, the prototype would have to ensure the given data, either the single file or set of directories, is up to date within a given timeframe. The simplest way to do this would be to have the prototype wake up at regular intervals in step with the user configuration. Upon such an event, the prototype would need to in some manner examine the entire set of data such that it compares this examination with the last wakeup event. This comparision should yield a set of address ranges that have changed from the last update, which are then compared with each partnership to obtain the set of partners that are storing the corresponding old data. The prototype then connects to each and uploads the newer data while monitoring the upload bandwidth of each connection to ensure it meets its SLA. Once complete, it retires until the next update.

Considerations

To do this, the prototype would need to be able to map the address ranges back to the directory structure, which could potentially map over several files and directories.

The prototype supposedly would determine whether the outgoing network capacity was sufficient to service the user's demand at the time the user configured the software. But this relied on an average update size and a measure of deviation from it. Thus, the prototype could potentially have more data to update than expected. Or a set of circumstances could arise that otherwise impede the update capability. Thus, the prototype must also manage the outgoing bandwidth such that the update occurs as quickly as possible but does not allow for the network capacity to become the bottleneck, restricting the bandwidth for a connection, which the prototype might incorrectly assume is the serving host violating the SLA. One way of overcoming this obstacle, is to make a single connection at a time, dedicating the entirety of the upload pipe to it and then allow the serving host to throttle the bandwidth down. Subsequent connections can then be made, one after another, with a similar protocol until there is not enough bandwidth left to properly service another connection.

Lastly, it's not difficult to assume that updates to the data do not occur all at once, but rather are spread over the update time frame interval. Therefore, the size of the needed update would increase with the time interval. Network utilization would be negatively affected, though, as larger bursts of transfers would be required at longer intervals while a more sustained average usage would be needed for shorter, almost continuous, intervals. This, however, assumes that the set of data to be backed up, if it were a set of directories, does not include those where running processes would be creating temporary files. (We could potentially examine this trade-off)

Issues/Questions

This raises the issue of what happens if more data needs backed up than the software has time to transfer it. Would the software potentially miss an update interval? What then happens if an update occurs while it's trying to transfer the previous snapshot? Can we include the update? Would it simply be easier to include that in the next snapshot? Would that violate what the user needs?

Thursday, February 23, 2006

The Plan

Steps to completing this project:
  1. Practice using Netnice
  2. Hope I get Samsara soon (I have heard from Landon Cox at Duke)
  3. Examine how Samsara/Pastiche find and creates storage partnerships
  4. Implement bandwidth reservations/restrictions for all partnerships
  5. Implement SLA negotiation between two peers
  6. Ensure the SLA is cannot be violated
  7. Optionally augment the Pastiche routing protocol to include finding peers with desired SLA characteristics
  8. Perform some kind of experimentation on the system to determine.... something
  9. Write a paper describing the work and the experiment

Wednesday, February 22, 2006

Netnice

After discussing the basics of Netnice with someone more knowledge than I am with it, I've found a potential snag in its usefulness. Apparently, Netnice is capable of limiting the bandwidth on a process by process basis whereas the demands for this application seem to need control on a connection by connection, all of which may be created within a single process.

However, this may not be an issue. If the Samsara code I'm expected to get soon forks a separate process for each connection, Netnice should work perfectly. Or, if it's multithreaded in a similar way to the fork() call, which I've seen Linux do, Netnice will probably be able to control its bandwidth with the granularity needed.

I'm expecting the daemon to be a self-contained, single process, though, in which case there are two possible solutions I imagine at the moment. At any given time, only a certain number of connections will be established. Each will have a set bandwidth allocation, and the aggregate total of these could be the value Netnice uses to shape the process' bandwidth. The alternative would be to implement a custom shaping scheme involving the packet size and a timer.

Once I get the code, I'll reexamine my options...

Administration

A portion of the bandwidth used by any particular node in a P2P backup system has to be dedicated to administrative purposes. After a service level agreement is negotiated, a node must constantly ensure that his partner is available and that its data is available on that partner's shared storage. While the former is relatively negligible, ensuring the data is available could be rather expensive. The Samsara and/or Pastiche papers suggested using a computed hash of the data to ensure the data is stored on the host, but Lillibridge points out that this will not ensure the partnering node will provide the data when it's needed. The benefit, which is the unused bandwidth, gained from not supplying the data despite the steep cost of having to hold the data anyway seems rather small but still creates a potential impact to the reliability of the system.

Thus, the added administrative cost is the size of the data block that needs transferred, the bandwidth (time) used in performing the transfer, and the frequency with which this needs to occur for all partners of a paritcular node.

Presumbly, this cost is fixed, and hence assumed, for any service level agreement between two partners. It should, however, be chosen so as to minimize the cost of the administration of the partnership.

Tuesday, February 21, 2006

Negotiation

Each node in the P2P backup system must be able to negotiate it's own service level agreements based on the needs specified by the user. Assuming that a node engages in reciprocol storage relationships with N other nodes in the system, it can agree upon an individual service level agreement at the same time with each of the N nodes based upon the portion of the data it wishes to store on any given one of them. The needed bandwidth to perform an update on the data can be divided among the N nodes in proportion with the amount of data that will be stored on the given node but may not necessarily be equal to the bandwidth the node agrees to supply to its partners.

The duty of enforcing a service level agreement falls upon the supplier. It is in the node's best interest to ensure that none of its partners are consuming more bandwidth or storage than was agreed upon. While enforcing storage space limits has been dealt with in Samsara, a node must be able to restrict a connection from a partner to the agreed upon limit of bandwidth if it needs or chooses to allocate the bandwidth elsewhere. Netnice should work well for this provided it allows the bandwidth provided to a certain connection to be arbitrarily changed.

A malicious node may restrict a connection down beyond the agreed upon bandwidth, but a partner in such a situation would be able to terminate the agreement and find another, more benevolent partner. This would then starve malicious nodes, forcing them in the least to meet the minimum of an agreed upon SLA.

Monday, February 20, 2006

P2P Backup Systems

The System. Peer to peer backup systems offer a relatively cheaper alternative to Internet backup sites while potentially providing the added security of spreading backed up data across multiple, geographically distant hosts. Reliability in such systems depends on ensuring resources are available perform the backup, store the data on a remote host, and retrieve the data when needed. This requires that a large portion of the hosts meet the needs of other peers to store and supply data when needed and that the system ensure no single host is unexpectedly or greedily consuming resources (bandwidth or storage) such that it deprives other hosts from properly using the system.

Commercialization of these systems then require each host to make and conform to service level agreements, which simultaneously predicts the hosts backup behavior and allows the system to ensure resources are available to service this typical behavior.

Service Level Agreements. A typical service level agreement would be similar to a user stating, "I have this directory that needs backed up every t [minutes, hours, days, weeks, years], and, on average, X bytes of data change in that time frame." Two criterion come from this: the initial size of the directory, which is the size of the storage that will be needed across the system; and the bandwidth needed to meet the update demands, which requires some analysis.

Bandwidth. If at every interval of time, X and only X bytes of data needed to be backed up, the bare minimum bandwidth need for the system to back up the data is X/t. While this may provide a good estimate, it would be wise to assume the user of the backing up node may have spikes and lulls in the amount of data that needs backed up from interval to interval. Thus, a measure of the standard deviation in this average amount of data to be backed up would allow an upper bound to be determined on the amount of bandwidth needed to service the majority of the user's needs. Furthermore, it would allow the system to reject or charge extra for needs that exceed this agreed upon SLA.

A corollary to this argument is that, if the bandwidth is available, the node should consume all unused bandwidth to achieve a consistent, backed up state as soon as possible. Under the conditions, each node must still be guaranteed the minimum bandwidth it needs to meet its needs, which requires the system to be able to back off a node that is consuming more bandwidth that it needs in order to ensure that any other node receives its share of the bandwidth.