Annotation of embedaddon/rsync/tech_report.tex, revision 1.1

1.1     ! misho       1: \documentclass[a4paper]{article}
        !             2: \begin{document}
        !             3: 
        !             4: 
        !             5: \title{The rsync algorithm}
        !             6: 
        !             7: \author{Andrew Tridgell \quad\quad Paul Mackerras\\
        !             8: Department of Computer Science \\
        !             9: Australian National University \\
        !            10: Canberra, ACT 0200, Australia}
        !            11: 
        !            12: \maketitle
        !            13: 
        !            14: \begin{abstract}
        !            15:   This report presents an algorithm for updating a file on one machine
        !            16:   to be identical to a file on another machine.  We assume that the
        !            17:   two machines are connected by a low-bandwidth high-latency
        !            18:   bi-directional communications link.  The algorithm identifies parts
        !            19:   of the source file which are identical to some part of the
        !            20:   destination file, and only sends those parts which cannot be matched
        !            21:   in this way.  Effectively, the algorithm computes a set of
        !            22:   differences without having both files on the same machine.  The
        !            23:   algorithm works best when the files are similar, but will also
        !            24:   function correctly and reasonably efficiently when the files are
        !            25:   quite different.
        !            26: \end{abstract}
        !            27: 
        !            28: \section{The problem}
        !            29: 
        !            30: Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
        !            31: the same as $A$. The obvious method is to copy $A$ onto $B$.
        !            32: 
        !            33: Now imagine that the two files are on machines connected by a slow
        !            34: communications link, for example a dialup IP link.  If $A$ is large,
        !            35: copying $A$ onto $B$ will be slow.  To make it faster you could
        !            36: compress $A$ before sending it, but that will usually only gain a
        !            37: factor of 2 to 4.
        !            38: 
        !            39: Now assume that $A$ and $B$ are quite similar, perhaps both derived
        !            40: from the same original file. To really speed things up you would need
        !            41: to take advantage of this similarity. A common method is to send just
        !            42: the differences between $A$ and $B$ down the link and then use this
        !            43: list of differences to reconstruct the file.
        !            44: 
        !            45: The problem is that the normal methods for creating a set of
        !            46: differences between two files rely on being able to read both files.
        !            47: Thus they require that both files are available beforehand at one end
        !            48: of the link.  If they are not both available on the same machine,
        !            49: these algorithms cannot be used (once you had copied the file over,
        !            50: you wouldn't need the differences).  This is the problem that rsync
        !            51: addresses.
        !            52: 
        !            53: The rsync algorithm efficiently computes which parts of a source file
        !            54: match some part of an existing destination file.  These parts need not
        !            55: be sent across the link; all that is needed is a reference to the part
        !            56: of the destination file.  Only parts of the source file which are not
        !            57: matched in this way need to be sent verbatim.  The receiver can then
        !            58: construct a copy of the source file using the references to parts of
        !            59: the existing destination file and the verbatim material.
        !            60: 
        !            61: Trivially, the data sent to the receiver can be compressed using any
        !            62: of a range of common compression algorithms, for further speed
        !            63: improvements.
        !            64: 
        !            65: \section{The rsync algorithm}
        !            66: 
        !            67: Suppose we have two general purpose computers $\alpha$ and $\beta$.
        !            68: Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
        !            69: file $B$, where $A$ and $B$ are ``similar''.  There is a slow
        !            70: communications link between $\alpha$ and $\beta$.
        !            71: 
        !            72: The rsync algorithm consists of the following steps:
        !            73: 
        !            74: \begin{enumerate}
        !            75: \item $\beta$ splits the file $B$ into a series of non-overlapping
        !            76:   fixed-sized blocks of size S bytes\footnote{We have found that
        !            77:   values of S between 500 and 1000 are quite good for most purposes}.
        !            78:   The last block may be shorter than $S$ bytes.
        !            79: 
        !            80: \item For each of these blocks $\beta$ calculates two checksums:
        !            81:   a weak ``rolling'' 32-bit checksum (described below) and a strong
        !            82:   128-bit MD4 checksum.
        !            83: 
        !            84: \item $\beta$ sends these checksums to $\alpha$.
        !            85:   
        !            86: \item $\alpha$ searches through $A$ to find all blocks of length $S$
        !            87:   bytes (at any offset, not just multiples of $S$) that have the same
        !            88:   weak and strong checksum as one of the blocks of $B$. This can be
        !            89:   done in a single pass very quickly using a special property of the
        !            90:   rolling checksum described below.
        !            91:   
        !            92: \item $\alpha$ sends $\beta$ a sequence of instructions for
        !            93:   constructing a copy of $A$.  Each instruction is either a reference
        !            94:   to a block of $B$, or literal data.  Literal data is sent only for
        !            95:   those sections of $A$ which did not match any of the blocks of $B$.
        !            96: \end{enumerate}
        !            97: 
        !            98: The end result is that $\beta$ gets a copy of $A$, but only the pieces
        !            99: of $A$ that are not found in $B$ (plus a small amount of data for
        !           100: checksums and block indexes) are sent over the link. The algorithm
        !           101: also only requires one round trip, which minimises the impact of the
        !           102: link latency.
        !           103: 
        !           104: The most important details of the algorithm are the rolling checksum
        !           105: and the associated multi-alternate search mechanism which allows the
        !           106: all-offsets checksum search to proceed very quickly. These will be
        !           107: discussed in greater detail below.
        !           108: 
        !           109: \section{Rolling checksum}
        !           110: 
        !           111: The weak rolling checksum used in the rsync algorithm needs to have
        !           112: the property that it is very cheap to calculate the checksum of a
        !           113: buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and
        !           114: the values of the bytes $X_1$ and $X_{n+1}$.
        !           115: 
        !           116: The weak checksum algorithm we used in our implementation was inspired
        !           117: by Mark Adler's adler-32 checksum.  Our checksum is defined by
        !           118: $$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$
        !           119: $$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$
        !           120: $$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$
        !           121: 
        !           122: where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$.
        !           123: For simplicity and speed, we use $M = 2^{16}$.  
        !           124: 
        !           125: The important property of this checksum is that successive values can
        !           126: be computed very efficiently using the recurrence relations
        !           127: 
        !           128: $$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$
        !           129: $$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$
        !           130: 
        !           131: Thus the checksum can be calculated for blocks of length S at all
        !           132: possible offsets within a file in a ``rolling'' fashion, with very
        !           133: little computation at each point.
        !           134: 
        !           135: Despite its simplicity, this checksum was found to be quite adequate as
        !           136: a first-level check for a match of two file blocks.  We have found in
        !           137: practice that the probability of this checksum matching when the
        !           138: blocks are not equal is quite low.  This is important because the much
        !           139: more expensive strong checksum must be calculated for each block where
        !           140: the weak checksum matches.
        !           141: 
        !           142: \section{Checksum searching}
        !           143: 
        !           144: Once $\alpha$ has received the list of checksums of the blocks of $B$,
        !           145: it must search $A$ for any blocks at any offset that match the
        !           146: checksum of some block of $B$.  The basic strategy is to compute the
        !           147: 32-bit rolling checksum for a block of length $S$ starting at each
        !           148: byte of $A$ in turn, and for each checksum, search the list for a
        !           149: match.  To do this our implementation uses a
        !           150: simple 3 level searching scheme.
        !           151: 
        !           152: The first level uses a 16-bit hash of the 32-bit rolling checksum and
        !           153: a $2^{16}$ entry hash table.  The list of checksum values (i.e., the
        !           154: checksums from the blocks of $B$) is sorted according to the 16-bit
        !           155: hash of the 32-bit rolling checksum.  Each entry in the hash table
        !           156: points to the first element of the list for that hash value, or
        !           157: contains a null value if no element of the list has that hash value.
        !           158: 
        !           159: At each offset in the file the 32-bit rolling checksum and its 16-bit
        !           160: hash are calculated.  If the hash table entry for that hash value is
        !           161: not a null value, the second-level check is invoked.
        !           162: 
        !           163: The second-level check involves scanning the sorted checksum list
        !           164: starting with the entry pointed to by the hash table entry, looking
        !           165: for an entry whose 32-bit rolling checksum matches the current value.
        !           166: The scan terminates when it reaches an entry whose 16-bit hash
        !           167: differs.  If this search finds a match, the third-level check is
        !           168: invoked.
        !           169: 
        !           170: The third-level check involves calculating the strong checksum for the
        !           171: current offset in the file and comparing it with the strong checksum
        !           172: value in the current list entry.  If the two strong checksums match,
        !           173: we assume that we have found a block of $A$ which matches a block of
        !           174: $B$.  In fact the blocks could be different, but the probability of
        !           175: this is microscopic, and in practice this is a reasonable assumption.
        !           176: 
        !           177: When a match is found, $\alpha$ sends $\beta$ the data in $A$ between
        !           178: the current file offset and the end of the previous match, followed by
        !           179: the index of the block in $B$ that matched.  This data is sent
        !           180: immediately a match is found, which allows us to overlap the
        !           181: communication with further computation.
        !           182: 
        !           183: If no match is found at a given offset in the file, the rolling
        !           184: checksum is updated to the next offset and the search proceeds.  If a
        !           185: match is found, the search is restarted at the end of the matched
        !           186: block.  This strategy saves a considerable amount of computation for
        !           187: the common case where the two files are nearly identical.  In
        !           188: addition, it would be a simple matter to encode the block indexes as
        !           189: runs, for the common case where a portion of $A$ matches a series of
        !           190: blocks of $B$ in order.
        !           191: 
        !           192: \section{Pipelining}
        !           193: 
        !           194: The above sections describe the process for constructing a copy of one
        !           195: file on a remote system.  If we have a several files to copy, we can
        !           196: gain a considerable latency advantage by pipelining the process.
        !           197: 
        !           198: This involves $\beta$ initiating two independent processes. One of the
        !           199: processes generates and sends the checksums to $\alpha$ while the
        !           200: other receives the difference information from $\alpha$ and
        !           201: reconstructs the files. 
        !           202: 
        !           203: If the communications link is buffered then these two processes can
        !           204: proceed independently and the link should be kept fully utilised in
        !           205: both directions for most of the time. 
        !           206: 
        !           207: \section{Results}
        !           208: 
        !           209: To test the algorithm, tar files were created of the Linux kernel
        !           210: sources for two versions of the kernel. The two kernel versions were
        !           211: 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and
        !           212: are separated by 5 released patch levels.
        !           213: 
        !           214: Out of the 2441 files in the 1.99.10 release, 291 files had changed in
        !           215: the 2.0.0 release, 19 files had been removed and 25 files had been
        !           216: added.
        !           217: 
        !           218: A ``diff'' of the two tar files using the standard GNU diff utility
        !           219: produced over 32 thousand lines of output totalling 2.1 MB.
        !           220: 
        !           221: The following table shows the results for rsync between the two files
        !           222: with a varying block size.\footnote{All the tests in this section were
        !           223:   carried out using rsync version 0.5}
        !           224: 
        !           225: \vspace*{5mm}
        !           226: \begin{tabular}{|l|l|l|l|l|l|l|} \hline
        !           227: {\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
        !           228: {\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline
        !           229: 
        !           230: 300         & 64247         & 3817434    & 948          & 5312200    & 5629158        & 1632284   \\ \hline
        !           231: 500         & 46989         & 620013     & 64           & 1091900    & 1283906        & 979384    \\ \hline
        !           232: 700         & 33255         & 571970     & 22           & 1307800    & 1444346        & 699564    \\ \hline
        !           233: 900         & 25686         & 525058     & 24           & 1469500    & 1575438        & 544124    \\ \hline
        !           234: 1100        & 20848         & 496844     & 21           & 1654500    & 1740838        & 445204    \\ \hline
        !           235: \end{tabular}
        !           236: \vspace*{5mm}
        !           237: 
        !           238: In each case, the CPU time taken was less than the
        !           239: time it takes to run ``diff'' on the two files.\footnote{The wall
        !           240:   clock time was approximately 2 minutes per run on a 50 MHz SPARC 10
        !           241:   running SunOS, using rsh over loopback for communication.  GNU diff
        !           242:   took about 4 minutes.}
        !           243: 
        !           244: The columns in the table are as follows:
        !           245: 
        !           246: \begin{description}
        !           247: \item [block size] The size in bytes of the checksummed blocks.
        !           248: \item [matches] The number of times a block of $B$ was found in $A$.
        !           249: \item [tag hits] The number of times the 16-bit hash of the rolling
        !           250:   checksum matched a hash of one of the checksums from $B$.
        !           251: \item [false alarms] The number of times the 32-bit rolling checksum
        !           252:   matched but the strong checksum didn't.
        !           253: \item [data] The amount of file data transferred verbatim, in bytes.
        !           254: \item [written] The total number of bytes written by $\alpha$,
        !           255:   including protocol overheads. This is almost all file data.
        !           256: \item [read] The total number of bytes read by $\alpha$, including
        !           257:   protocol overheads. This is almost all checksum information.
        !           258: \end{description}
        !           259: 
        !           260: The results demonstrate that for block sizes above 300 bytes, only a
        !           261: small fraction (around 5\%) of the file was transferred. The amount
        !           262: transferred was also considerably less than the size of the diff file
        !           263: that would have been transferred if the diff/patch method of updating
        !           264: a remote file was used.
        !           265: 
        !           266: The checksums themselves took up a considerable amount of space,
        !           267: although much less than the size of the data transferred in each
        !           268: case. Each pair of checksums consumes 20 bytes: 4 bytes for the
        !           269: rolling checksum plus 16 bytes for the 128-bit MD4 checksum.
        !           270: 
        !           271: The number of false alarms was less than $1/1000$ of the number of
        !           272: true matches, indicating that the 32-bit rolling checksum is quite
        !           273: good at screening out false matches. 
        !           274: 
        !           275: The number of tag hits indicates that the second level of the
        !           276: checksum search algorithm was invoked about once every 50
        !           277: characters.  This is quite high because the total number of blocks in
        !           278: the file is a large fraction of the size of the tag hash table. For
        !           279: smaller files we would expect the tag hit rate to be much closer to
        !           280: the number of matches.  For extremely large files, we should probably
        !           281: increase the size of the hash table.
        !           282: 
        !           283: The next table shows similar results for a much smaller set of files.
        !           284: In this case the files were not packed into a tar file first. Rather,
        !           285: rsync was invoked with an option to recursively descend the directory
        !           286: tree. The files used were from two source releases of another software
        !           287: package called Samba. The total source code size is 1.7 MB and the
        !           288: diff between the two releases is 4155 lines long totalling 120 kB.
        !           289: 
        !           290: \vspace*{5mm}
        !           291: \begin{tabular}{|l|l|l|l|l|l|l|} \hline
        !           292: {\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
        !           293: {\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline
        !           294: 
        !           295: 300         & 3727          & 3899       & 0            & 129775     & 153999         & 83948       \\ \hline
        !           296: 500         & 2158          & 2325       & 0            & 171574     & 189330         & 50908       \\ \hline
        !           297: 700         & 1517          & 1649       & 0            & 195024     & 210144         & 36828        \\ \hline
        !           298: 900         & 1156          & 1281       & 0            & 222847     & 236471         & 29048        \\ \hline
        !           299: 1100        & 921           & 1049       & 0            & 250073     & 262725         & 23988        \\ \hline
        !           300: \end{tabular}
        !           301: \vspace*{5mm}
        !           302: 
        !           303: 
        !           304: \section{Availability}
        !           305: 
        !           306: An implementation of rsync which provides a convenient interface
        !           307: similar to the common UNIX command rcp has been written and is
        !           308: available for download from http://rsync.samba.org/
        !           309: 
        !           310: \end{document}

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>