File:  [ELWIX - Embedded LightWeight unIX -] / embedaddon / rsync / tech_report.tex
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Fri Feb 17 15:09:30 2012 UTC (12 years, 4 months ago) by misho
Branches: rsync, MAIN
CVS tags: v3_2_3, v3_1_2p5, rsync3_0_9p0, RSYNC3_1_0, RSYNC3_0_9, HEAD
rsync

    1: \documentclass[a4paper]{article}
    2: \begin{document}
    3: 
    4: 
    5: \title{The rsync algorithm}
    6: 
    7: \author{Andrew Tridgell \quad\quad Paul Mackerras\\
    8: Department of Computer Science \\
    9: Australian National University \\
   10: Canberra, ACT 0200, Australia}
   11: 
   12: \maketitle
   13: 
   14: \begin{abstract}
   15:   This report presents an algorithm for updating a file on one machine
   16:   to be identical to a file on another machine.  We assume that the
   17:   two machines are connected by a low-bandwidth high-latency
   18:   bi-directional communications link.  The algorithm identifies parts
   19:   of the source file which are identical to some part of the
   20:   destination file, and only sends those parts which cannot be matched
   21:   in this way.  Effectively, the algorithm computes a set of
   22:   differences without having both files on the same machine.  The
   23:   algorithm works best when the files are similar, but will also
   24:   function correctly and reasonably efficiently when the files are
   25:   quite different.
   26: \end{abstract}
   27: 
   28: \section{The problem}
   29: 
   30: Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
   31: the same as $A$. The obvious method is to copy $A$ onto $B$.
   32: 
   33: Now imagine that the two files are on machines connected by a slow
   34: communications link, for example a dialup IP link.  If $A$ is large,
   35: copying $A$ onto $B$ will be slow.  To make it faster you could
   36: compress $A$ before sending it, but that will usually only gain a
   37: factor of 2 to 4.
   38: 
   39: Now assume that $A$ and $B$ are quite similar, perhaps both derived
   40: from the same original file. To really speed things up you would need
   41: to take advantage of this similarity. A common method is to send just
   42: the differences between $A$ and $B$ down the link and then use this
   43: list of differences to reconstruct the file.
   44: 
   45: The problem is that the normal methods for creating a set of
   46: differences between two files rely on being able to read both files.
   47: Thus they require that both files are available beforehand at one end
   48: of the link.  If they are not both available on the same machine,
   49: these algorithms cannot be used (once you had copied the file over,
   50: you wouldn't need the differences).  This is the problem that rsync
   51: addresses.
   52: 
   53: The rsync algorithm efficiently computes which parts of a source file
   54: match some part of an existing destination file.  These parts need not
   55: be sent across the link; all that is needed is a reference to the part
   56: of the destination file.  Only parts of the source file which are not
   57: matched in this way need to be sent verbatim.  The receiver can then
   58: construct a copy of the source file using the references to parts of
   59: the existing destination file and the verbatim material.
   60: 
   61: Trivially, the data sent to the receiver can be compressed using any
   62: of a range of common compression algorithms, for further speed
   63: improvements.
   64: 
   65: \section{The rsync algorithm}
   66: 
   67: Suppose we have two general purpose computers $\alpha$ and $\beta$.
   68: Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
   69: file $B$, where $A$ and $B$ are ``similar''.  There is a slow
   70: communications link between $\alpha$ and $\beta$.
   71: 
   72: The rsync algorithm consists of the following steps:
   73: 
   74: \begin{enumerate}
   75: \item $\beta$ splits the file $B$ into a series of non-overlapping
   76:   fixed-sized blocks of size S bytes\footnote{We have found that
   77:   values of S between 500 and 1000 are quite good for most purposes}.
   78:   The last block may be shorter than $S$ bytes.
   79: 
   80: \item For each of these blocks $\beta$ calculates two checksums:
   81:   a weak ``rolling'' 32-bit checksum (described below) and a strong
   82:   128-bit MD4 checksum.
   83: 
   84: \item $\beta$ sends these checksums to $\alpha$.
   85:   
   86: \item $\alpha$ searches through $A$ to find all blocks of length $S$
   87:   bytes (at any offset, not just multiples of $S$) that have the same
   88:   weak and strong checksum as one of the blocks of $B$. This can be
   89:   done in a single pass very quickly using a special property of the
   90:   rolling checksum described below.
   91:   
   92: \item $\alpha$ sends $\beta$ a sequence of instructions for
   93:   constructing a copy of $A$.  Each instruction is either a reference
   94:   to a block of $B$, or literal data.  Literal data is sent only for
   95:   those sections of $A$ which did not match any of the blocks of $B$.
   96: \end{enumerate}
   97: 
   98: The end result is that $\beta$ gets a copy of $A$, but only the pieces
   99: of $A$ that are not found in $B$ (plus a small amount of data for
  100: checksums and block indexes) are sent over the link. The algorithm
  101: also only requires one round trip, which minimises the impact of the
  102: link latency.
  103: 
  104: The most important details of the algorithm are the rolling checksum
  105: and the associated multi-alternate search mechanism which allows the
  106: all-offsets checksum search to proceed very quickly. These will be
  107: discussed in greater detail below.
  108: 
  109: \section{Rolling checksum}
  110: 
  111: The weak rolling checksum used in the rsync algorithm needs to have
  112: the property that it is very cheap to calculate the checksum of a
  113: buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and
  114: the values of the bytes $X_1$ and $X_{n+1}$.
  115: 
  116: The weak checksum algorithm we used in our implementation was inspired
  117: by Mark Adler's adler-32 checksum.  Our checksum is defined by
  118: $$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$
  119: $$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$
  120: $$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$
  121: 
  122: where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$.
  123: For simplicity and speed, we use $M = 2^{16}$.  
  124: 
  125: The important property of this checksum is that successive values can
  126: be computed very efficiently using the recurrence relations
  127: 
  128: $$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$
  129: $$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$
  130: 
  131: Thus the checksum can be calculated for blocks of length S at all
  132: possible offsets within a file in a ``rolling'' fashion, with very
  133: little computation at each point.
  134: 
  135: Despite its simplicity, this checksum was found to be quite adequate as
  136: a first-level check for a match of two file blocks.  We have found in
  137: practice that the probability of this checksum matching when the
  138: blocks are not equal is quite low.  This is important because the much
  139: more expensive strong checksum must be calculated for each block where
  140: the weak checksum matches.
  141: 
  142: \section{Checksum searching}
  143: 
  144: Once $\alpha$ has received the list of checksums of the blocks of $B$,
  145: it must search $A$ for any blocks at any offset that match the
  146: checksum of some block of $B$.  The basic strategy is to compute the
  147: 32-bit rolling checksum for a block of length $S$ starting at each
  148: byte of $A$ in turn, and for each checksum, search the list for a
  149: match.  To do this our implementation uses a
  150: simple 3 level searching scheme.
  151: 
  152: The first level uses a 16-bit hash of the 32-bit rolling checksum and
  153: a $2^{16}$ entry hash table.  The list of checksum values (i.e., the
  154: checksums from the blocks of $B$) is sorted according to the 16-bit
  155: hash of the 32-bit rolling checksum.  Each entry in the hash table
  156: points to the first element of the list for that hash value, or
  157: contains a null value if no element of the list has that hash value.
  158: 
  159: At each offset in the file the 32-bit rolling checksum and its 16-bit
  160: hash are calculated.  If the hash table entry for that hash value is
  161: not a null value, the second-level check is invoked.
  162: 
  163: The second-level check involves scanning the sorted checksum list
  164: starting with the entry pointed to by the hash table entry, looking
  165: for an entry whose 32-bit rolling checksum matches the current value.
  166: The scan terminates when it reaches an entry whose 16-bit hash
  167: differs.  If this search finds a match, the third-level check is
  168: invoked.
  169: 
  170: The third-level check involves calculating the strong checksum for the
  171: current offset in the file and comparing it with the strong checksum
  172: value in the current list entry.  If the two strong checksums match,
  173: we assume that we have found a block of $A$ which matches a block of
  174: $B$.  In fact the blocks could be different, but the probability of
  175: this is microscopic, and in practice this is a reasonable assumption.
  176: 
  177: When a match is found, $\alpha$ sends $\beta$ the data in $A$ between
  178: the current file offset and the end of the previous match, followed by
  179: the index of the block in $B$ that matched.  This data is sent
  180: immediately a match is found, which allows us to overlap the
  181: communication with further computation.
  182: 
  183: If no match is found at a given offset in the file, the rolling
  184: checksum is updated to the next offset and the search proceeds.  If a
  185: match is found, the search is restarted at the end of the matched
  186: block.  This strategy saves a considerable amount of computation for
  187: the common case where the two files are nearly identical.  In
  188: addition, it would be a simple matter to encode the block indexes as
  189: runs, for the common case where a portion of $A$ matches a series of
  190: blocks of $B$ in order.
  191: 
  192: \section{Pipelining}
  193: 
  194: The above sections describe the process for constructing a copy of one
  195: file on a remote system.  If we have a several files to copy, we can
  196: gain a considerable latency advantage by pipelining the process.
  197: 
  198: This involves $\beta$ initiating two independent processes. One of the
  199: processes generates and sends the checksums to $\alpha$ while the
  200: other receives the difference information from $\alpha$ and
  201: reconstructs the files. 
  202: 
  203: If the communications link is buffered then these two processes can
  204: proceed independently and the link should be kept fully utilised in
  205: both directions for most of the time. 
  206: 
  207: \section{Results}
  208: 
  209: To test the algorithm, tar files were created of the Linux kernel
  210: sources for two versions of the kernel. The two kernel versions were
  211: 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and
  212: are separated by 5 released patch levels.
  213: 
  214: Out of the 2441 files in the 1.99.10 release, 291 files had changed in
  215: the 2.0.0 release, 19 files had been removed and 25 files had been
  216: added.
  217: 
  218: A ``diff'' of the two tar files using the standard GNU diff utility
  219: produced over 32 thousand lines of output totalling 2.1 MB.
  220: 
  221: The following table shows the results for rsync between the two files
  222: with a varying block size.\footnote{All the tests in this section were
  223:   carried out using rsync version 0.5}
  224: 
  225: \vspace*{5mm}
  226: \begin{tabular}{|l|l|l|l|l|l|l|} \hline
  227: {\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
  228: {\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline
  229: 
  230: 300         & 64247         & 3817434    & 948          & 5312200    & 5629158        & 1632284   \\ \hline
  231: 500         & 46989         & 620013     & 64           & 1091900    & 1283906        & 979384    \\ \hline
  232: 700         & 33255         & 571970     & 22           & 1307800    & 1444346        & 699564    \\ \hline
  233: 900         & 25686         & 525058     & 24           & 1469500    & 1575438        & 544124    \\ \hline
  234: 1100        & 20848         & 496844     & 21           & 1654500    & 1740838        & 445204    \\ \hline
  235: \end{tabular}
  236: \vspace*{5mm}
  237: 
  238: In each case, the CPU time taken was less than the
  239: time it takes to run ``diff'' on the two files.\footnote{The wall
  240:   clock time was approximately 2 minutes per run on a 50 MHz SPARC 10
  241:   running SunOS, using rsh over loopback for communication.  GNU diff
  242:   took about 4 minutes.}
  243: 
  244: The columns in the table are as follows:
  245: 
  246: \begin{description}
  247: \item [block size] The size in bytes of the checksummed blocks.
  248: \item [matches] The number of times a block of $B$ was found in $A$.
  249: \item [tag hits] The number of times the 16-bit hash of the rolling
  250:   checksum matched a hash of one of the checksums from $B$.
  251: \item [false alarms] The number of times the 32-bit rolling checksum
  252:   matched but the strong checksum didn't.
  253: \item [data] The amount of file data transferred verbatim, in bytes.
  254: \item [written] The total number of bytes written by $\alpha$,
  255:   including protocol overheads. This is almost all file data.
  256: \item [read] The total number of bytes read by $\alpha$, including
  257:   protocol overheads. This is almost all checksum information.
  258: \end{description}
  259: 
  260: The results demonstrate that for block sizes above 300 bytes, only a
  261: small fraction (around 5\%) of the file was transferred. The amount
  262: transferred was also considerably less than the size of the diff file
  263: that would have been transferred if the diff/patch method of updating
  264: a remote file was used.
  265: 
  266: The checksums themselves took up a considerable amount of space,
  267: although much less than the size of the data transferred in each
  268: case. Each pair of checksums consumes 20 bytes: 4 bytes for the
  269: rolling checksum plus 16 bytes for the 128-bit MD4 checksum.
  270: 
  271: The number of false alarms was less than $1/1000$ of the number of
  272: true matches, indicating that the 32-bit rolling checksum is quite
  273: good at screening out false matches. 
  274: 
  275: The number of tag hits indicates that the second level of the
  276: checksum search algorithm was invoked about once every 50
  277: characters.  This is quite high because the total number of blocks in
  278: the file is a large fraction of the size of the tag hash table. For
  279: smaller files we would expect the tag hit rate to be much closer to
  280: the number of matches.  For extremely large files, we should probably
  281: increase the size of the hash table.
  282: 
  283: The next table shows similar results for a much smaller set of files.
  284: In this case the files were not packed into a tar file first. Rather,
  285: rsync was invoked with an option to recursively descend the directory
  286: tree. The files used were from two source releases of another software
  287: package called Samba. The total source code size is 1.7 MB and the
  288: diff between the two releases is 4155 lines long totalling 120 kB.
  289: 
  290: \vspace*{5mm}
  291: \begin{tabular}{|l|l|l|l|l|l|l|} \hline
  292: {\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
  293: {\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline
  294: 
  295: 300         & 3727          & 3899       & 0            & 129775     & 153999         & 83948       \\ \hline
  296: 500         & 2158          & 2325       & 0            & 171574     & 189330         & 50908       \\ \hline
  297: 700         & 1517          & 1649       & 0            & 195024     & 210144         & 36828        \\ \hline
  298: 900         & 1156          & 1281       & 0            & 222847     & 236471         & 29048        \\ \hline
  299: 1100        & 921           & 1049       & 0            & 250073     & 262725         & 23988        \\ \hline
  300: \end{tabular}
  301: \vspace*{5mm}
  302: 
  303: 
  304: \section{Availability}
  305: 
  306: An implementation of rsync which provides a convenient interface
  307: similar to the common UNIX command rcp has been written and is
  308: available for download from http://rsync.samba.org/
  309: 
  310: \end{document}

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>