Annotation of embedaddon/rsync/tech_report.tex, revision 1.1.1.1

1.1       misho       1: \documentclass[a4paper]{article}
                      2: \begin{document}
                      3: 
                      4: 
                      5: \title{The rsync algorithm}
                      6: 
                      7: \author{Andrew Tridgell \quad\quad Paul Mackerras\\
                      8: Department of Computer Science \\
                      9: Australian National University \\
                     10: Canberra, ACT 0200, Australia}
                     11: 
                     12: \maketitle
                     13: 
                     14: \begin{abstract}
                     15:   This report presents an algorithm for updating a file on one machine
                     16:   to be identical to a file on another machine.  We assume that the
                     17:   two machines are connected by a low-bandwidth high-latency
                     18:   bi-directional communications link.  The algorithm identifies parts
                     19:   of the source file which are identical to some part of the
                     20:   destination file, and only sends those parts which cannot be matched
                     21:   in this way.  Effectively, the algorithm computes a set of
                     22:   differences without having both files on the same machine.  The
                     23:   algorithm works best when the files are similar, but will also
                     24:   function correctly and reasonably efficiently when the files are
                     25:   quite different.
                     26: \end{abstract}
                     27: 
                     28: \section{The problem}
                     29: 
                     30: Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
                     31: the same as $A$. The obvious method is to copy $A$ onto $B$.
                     32: 
                     33: Now imagine that the two files are on machines connected by a slow
                     34: communications link, for example a dialup IP link.  If $A$ is large,
                     35: copying $A$ onto $B$ will be slow.  To make it faster you could
                     36: compress $A$ before sending it, but that will usually only gain a
                     37: factor of 2 to 4.
                     38: 
                     39: Now assume that $A$ and $B$ are quite similar, perhaps both derived
                     40: from the same original file. To really speed things up you would need
                     41: to take advantage of this similarity. A common method is to send just
                     42: the differences between $A$ and $B$ down the link and then use this
                     43: list of differences to reconstruct the file.
                     44: 
                     45: The problem is that the normal methods for creating a set of
                     46: differences between two files rely on being able to read both files.
                     47: Thus they require that both files are available beforehand at one end
                     48: of the link.  If they are not both available on the same machine,
                     49: these algorithms cannot be used (once you had copied the file over,
                     50: you wouldn't need the differences).  This is the problem that rsync
                     51: addresses.
                     52: 
                     53: The rsync algorithm efficiently computes which parts of a source file
                     54: match some part of an existing destination file.  These parts need not
                     55: be sent across the link; all that is needed is a reference to the part
                     56: of the destination file.  Only parts of the source file which are not
                     57: matched in this way need to be sent verbatim.  The receiver can then
                     58: construct a copy of the source file using the references to parts of
                     59: the existing destination file and the verbatim material.
                     60: 
                     61: Trivially, the data sent to the receiver can be compressed using any
                     62: of a range of common compression algorithms, for further speed
                     63: improvements.
                     64: 
                     65: \section{The rsync algorithm}
                     66: 
                     67: Suppose we have two general purpose computers $\alpha$ and $\beta$.
                     68: Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
                     69: file $B$, where $A$ and $B$ are ``similar''.  There is a slow
                     70: communications link between $\alpha$ and $\beta$.
                     71: 
                     72: The rsync algorithm consists of the following steps:
                     73: 
                     74: \begin{enumerate}
                     75: \item $\beta$ splits the file $B$ into a series of non-overlapping
                     76:   fixed-sized blocks of size S bytes\footnote{We have found that
                     77:   values of S between 500 and 1000 are quite good for most purposes}.
                     78:   The last block may be shorter than $S$ bytes.
                     79: 
                     80: \item For each of these blocks $\beta$ calculates two checksums:
                     81:   a weak ``rolling'' 32-bit checksum (described below) and a strong
                     82:   128-bit MD4 checksum.
                     83: 
                     84: \item $\beta$ sends these checksums to $\alpha$.
                     85:   
                     86: \item $\alpha$ searches through $A$ to find all blocks of length $S$
                     87:   bytes (at any offset, not just multiples of $S$) that have the same
                     88:   weak and strong checksum as one of the blocks of $B$. This can be
                     89:   done in a single pass very quickly using a special property of the
                     90:   rolling checksum described below.
                     91:   
                     92: \item $\alpha$ sends $\beta$ a sequence of instructions for
                     93:   constructing a copy of $A$.  Each instruction is either a reference
                     94:   to a block of $B$, or literal data.  Literal data is sent only for
                     95:   those sections of $A$ which did not match any of the blocks of $B$.
                     96: \end{enumerate}
                     97: 
                     98: The end result is that $\beta$ gets a copy of $A$, but only the pieces
                     99: of $A$ that are not found in $B$ (plus a small amount of data for
                    100: checksums and block indexes) are sent over the link. The algorithm
                    101: also only requires one round trip, which minimises the impact of the
                    102: link latency.
                    103: 
                    104: The most important details of the algorithm are the rolling checksum
                    105: and the associated multi-alternate search mechanism which allows the
                    106: all-offsets checksum search to proceed very quickly. These will be
                    107: discussed in greater detail below.
                    108: 
                    109: \section{Rolling checksum}
                    110: 
                    111: The weak rolling checksum used in the rsync algorithm needs to have
                    112: the property that it is very cheap to calculate the checksum of a
                    113: buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and
                    114: the values of the bytes $X_1$ and $X_{n+1}$.
                    115: 
                    116: The weak checksum algorithm we used in our implementation was inspired
                    117: by Mark Adler's adler-32 checksum.  Our checksum is defined by
                    118: $$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$
                    119: $$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$
                    120: $$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$
                    121: 
                    122: where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$.
                    123: For simplicity and speed, we use $M = 2^{16}$.  
                    124: 
                    125: The important property of this checksum is that successive values can
                    126: be computed very efficiently using the recurrence relations
                    127: 
                    128: $$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$
                    129: $$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$
                    130: 
                    131: Thus the checksum can be calculated for blocks of length S at all
                    132: possible offsets within a file in a ``rolling'' fashion, with very
                    133: little computation at each point.
                    134: 
                    135: Despite its simplicity, this checksum was found to be quite adequate as
                    136: a first-level check for a match of two file blocks.  We have found in
                    137: practice that the probability of this checksum matching when the
                    138: blocks are not equal is quite low.  This is important because the much
                    139: more expensive strong checksum must be calculated for each block where
                    140: the weak checksum matches.
                    141: 
                    142: \section{Checksum searching}
                    143: 
                    144: Once $\alpha$ has received the list of checksums of the blocks of $B$,
                    145: it must search $A$ for any blocks at any offset that match the
                    146: checksum of some block of $B$.  The basic strategy is to compute the
                    147: 32-bit rolling checksum for a block of length $S$ starting at each
                    148: byte of $A$ in turn, and for each checksum, search the list for a
                    149: match.  To do this our implementation uses a
                    150: simple 3 level searching scheme.
                    151: 
                    152: The first level uses a 16-bit hash of the 32-bit rolling checksum and
                    153: a $2^{16}$ entry hash table.  The list of checksum values (i.e., the
                    154: checksums from the blocks of $B$) is sorted according to the 16-bit
                    155: hash of the 32-bit rolling checksum.  Each entry in the hash table
                    156: points to the first element of the list for that hash value, or
                    157: contains a null value if no element of the list has that hash value.
                    158: 
                    159: At each offset in the file the 32-bit rolling checksum and its 16-bit
                    160: hash are calculated.  If the hash table entry for that hash value is
                    161: not a null value, the second-level check is invoked.
                    162: 
                    163: The second-level check involves scanning the sorted checksum list
                    164: starting with the entry pointed to by the hash table entry, looking
                    165: for an entry whose 32-bit rolling checksum matches the current value.
                    166: The scan terminates when it reaches an entry whose 16-bit hash
                    167: differs.  If this search finds a match, the third-level check is
                    168: invoked.
                    169: 
                    170: The third-level check involves calculating the strong checksum for the
                    171: current offset in the file and comparing it with the strong checksum
                    172: value in the current list entry.  If the two strong checksums match,
                    173: we assume that we have found a block of $A$ which matches a block of
                    174: $B$.  In fact the blocks could be different, but the probability of
                    175: this is microscopic, and in practice this is a reasonable assumption.
                    176: 
                    177: When a match is found, $\alpha$ sends $\beta$ the data in $A$ between
                    178: the current file offset and the end of the previous match, followed by
                    179: the index of the block in $B$ that matched.  This data is sent
                    180: immediately a match is found, which allows us to overlap the
                    181: communication with further computation.
                    182: 
                    183: If no match is found at a given offset in the file, the rolling
                    184: checksum is updated to the next offset and the search proceeds.  If a
                    185: match is found, the search is restarted at the end of the matched
                    186: block.  This strategy saves a considerable amount of computation for
                    187: the common case where the two files are nearly identical.  In
                    188: addition, it would be a simple matter to encode the block indexes as
                    189: runs, for the common case where a portion of $A$ matches a series of
                    190: blocks of $B$ in order.
                    191: 
                    192: \section{Pipelining}
                    193: 
                    194: The above sections describe the process for constructing a copy of one
                    195: file on a remote system.  If we have a several files to copy, we can
                    196: gain a considerable latency advantage by pipelining the process.
                    197: 
                    198: This involves $\beta$ initiating two independent processes. One of the
                    199: processes generates and sends the checksums to $\alpha$ while the
                    200: other receives the difference information from $\alpha$ and
                    201: reconstructs the files. 
                    202: 
                    203: If the communications link is buffered then these two processes can
                    204: proceed independently and the link should be kept fully utilised in
                    205: both directions for most of the time. 
                    206: 
                    207: \section{Results}
                    208: 
                    209: To test the algorithm, tar files were created of the Linux kernel
                    210: sources for two versions of the kernel. The two kernel versions were
                    211: 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and
                    212: are separated by 5 released patch levels.
                    213: 
                    214: Out of the 2441 files in the 1.99.10 release, 291 files had changed in
                    215: the 2.0.0 release, 19 files had been removed and 25 files had been
                    216: added.
                    217: 
                    218: A ``diff'' of the two tar files using the standard GNU diff utility
                    219: produced over 32 thousand lines of output totalling 2.1 MB.
                    220: 
                    221: The following table shows the results for rsync between the two files
                    222: with a varying block size.\footnote{All the tests in this section were
                    223:   carried out using rsync version 0.5}
                    224: 
                    225: \vspace*{5mm}
                    226: \begin{tabular}{|l|l|l|l|l|l|l|} \hline
                    227: {\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
                    228: {\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline
                    229: 
                    230: 300         & 64247         & 3817434    & 948          & 5312200    & 5629158        & 1632284   \\ \hline
                    231: 500         & 46989         & 620013     & 64           & 1091900    & 1283906        & 979384    \\ \hline
                    232: 700         & 33255         & 571970     & 22           & 1307800    & 1444346        & 699564    \\ \hline
                    233: 900         & 25686         & 525058     & 24           & 1469500    & 1575438        & 544124    \\ \hline
                    234: 1100        & 20848         & 496844     & 21           & 1654500    & 1740838        & 445204    \\ \hline
                    235: \end{tabular}
                    236: \vspace*{5mm}
                    237: 
                    238: In each case, the CPU time taken was less than the
                    239: time it takes to run ``diff'' on the two files.\footnote{The wall
                    240:   clock time was approximately 2 minutes per run on a 50 MHz SPARC 10
                    241:   running SunOS, using rsh over loopback for communication.  GNU diff
                    242:   took about 4 minutes.}
                    243: 
                    244: The columns in the table are as follows:
                    245: 
                    246: \begin{description}
                    247: \item [block size] The size in bytes of the checksummed blocks.
                    248: \item [matches] The number of times a block of $B$ was found in $A$.
                    249: \item [tag hits] The number of times the 16-bit hash of the rolling
                    250:   checksum matched a hash of one of the checksums from $B$.
                    251: \item [false alarms] The number of times the 32-bit rolling checksum
                    252:   matched but the strong checksum didn't.
                    253: \item [data] The amount of file data transferred verbatim, in bytes.
                    254: \item [written] The total number of bytes written by $\alpha$,
                    255:   including protocol overheads. This is almost all file data.
                    256: \item [read] The total number of bytes read by $\alpha$, including
                    257:   protocol overheads. This is almost all checksum information.
                    258: \end{description}
                    259: 
                    260: The results demonstrate that for block sizes above 300 bytes, only a
                    261: small fraction (around 5\%) of the file was transferred. The amount
                    262: transferred was also considerably less than the size of the diff file
                    263: that would have been transferred if the diff/patch method of updating
                    264: a remote file was used.
                    265: 
                    266: The checksums themselves took up a considerable amount of space,
                    267: although much less than the size of the data transferred in each
                    268: case. Each pair of checksums consumes 20 bytes: 4 bytes for the
                    269: rolling checksum plus 16 bytes for the 128-bit MD4 checksum.
                    270: 
                    271: The number of false alarms was less than $1/1000$ of the number of
                    272: true matches, indicating that the 32-bit rolling checksum is quite
                    273: good at screening out false matches. 
                    274: 
                    275: The number of tag hits indicates that the second level of the
                    276: checksum search algorithm was invoked about once every 50
                    277: characters.  This is quite high because the total number of blocks in
                    278: the file is a large fraction of the size of the tag hash table. For
                    279: smaller files we would expect the tag hit rate to be much closer to
                    280: the number of matches.  For extremely large files, we should probably
                    281: increase the size of the hash table.
                    282: 
                    283: The next table shows similar results for a much smaller set of files.
                    284: In this case the files were not packed into a tar file first. Rather,
                    285: rsync was invoked with an option to recursively descend the directory
                    286: tree. The files used were from two source releases of another software
                    287: package called Samba. The total source code size is 1.7 MB and the
                    288: diff between the two releases is 4155 lines long totalling 120 kB.
                    289: 
                    290: \vspace*{5mm}
                    291: \begin{tabular}{|l|l|l|l|l|l|l|} \hline
                    292: {\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
                    293: {\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline
                    294: 
                    295: 300         & 3727          & 3899       & 0            & 129775     & 153999         & 83948       \\ \hline
                    296: 500         & 2158          & 2325       & 0            & 171574     & 189330         & 50908       \\ \hline
                    297: 700         & 1517          & 1649       & 0            & 195024     & 210144         & 36828        \\ \hline
                    298: 900         & 1156          & 1281       & 0            & 222847     & 236471         & 29048        \\ \hline
                    299: 1100        & 921           & 1049       & 0            & 250073     & 262725         & 23988        \\ \hline
                    300: \end{tabular}
                    301: \vspace*{5mm}
                    302: 
                    303: 
                    304: \section{Availability}
                    305: 
                    306: An implementation of rsync which provides a convenient interface
                    307: similar to the common UNIX command rcp has been written and is
                    308: available for download from http://rsync.samba.org/
                    309: 
                    310: \end{document}

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>