embedaddon/rsync/tech_report.tex - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / rsync / tech_report.tex
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Fri Feb 17 15:09:30 2012 UTC (12 years, 4 months ago) by misho
Branches: rsync, MAIN
CVS tags: v3_2_3, v3_1_2p5, rsync3_0_9p0, RSYNC3_1_0, RSYNC3_0_9, HEAD

rsync

1: \documentclass[a4paper]{article} 2: \begin{document} 3: 4: 5: \title{The rsync algorithm} 6: 7: \author{Andrew Tridgell \quad\quad Paul Mackerras\\ 8: Department of Computer Science \\ 9: Australian National University \\ 10: Canberra, ACT 0200, Australia} 11: 12: \maketitle 13: 14: \begin{abstract} 15: This report presents an algorithm for updating a file on one machine 16: to be identical to a file on another machine. We assume that the 17: two machines are connected by a low-bandwidth high-latency 18: bi-directional communications link. The algorithm identifies parts 19: of the source file which are identical to some part of the 20: destination file, and only sends those parts which cannot be matched 21: in this way. Effectively, the algorithm computes a set of 22: differences without having both files on the same machine. The 23: algorithm works best when the files are similar, but will also 24: function correctly and reasonably efficiently when the files are 25: quite different. 26: \end{abstract} 27: 28: \section{The problem} 29: 30: Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be 31: the same as $A$. The obvious method is to copy $A$ onto $B$. 32: 33: Now imagine that the two files are on machines connected by a slow 34: communications link, for example a dialup IP link. If $A$ is large, 35: copying $A$ onto $B$ will be slow. To make it faster you could 36: compress $A$ before sending it, but that will usually only gain a 37: factor of 2 to 4. 38: 39: Now assume that $A$ and $B$ are quite similar, perhaps both derived 40: from the same original file. To really speed things up you would need 41: to take advantage of this similarity. A common method is to send just 42: the differences between $A$ and $B$ down the link and then use this 43: list of differences to reconstruct the file. 44: 45: The problem is that the normal methods for creating a set of 46: differences between two files rely on being able to read both files. 47: Thus they require that both files are available beforehand at one end 48: of the link. If they are not both available on the same machine, 49: these algorithms cannot be used (once you had copied the file over, 50: you wouldn't need the differences). This is the problem that rsync 51: addresses. 52: 53: The rsync algorithm efficiently computes which parts of a source file 54: match some part of an existing destination file. These parts need not 55: be sent across the link; all that is needed is a reference to the part 56: of the destination file. Only parts of the source file which are not 57: matched in this way need to be sent verbatim. The receiver can then 58: construct a copy of the source file using the references to parts of 59: the existing destination file and the verbatim material. 60: 61: Trivially, the data sent to the receiver can be compressed using any 62: of a range of common compression algorithms, for further speed 63: improvements. 64: 65: \section{The rsync algorithm} 66: 67: Suppose we have two general purpose computers $\alpha$ and $\beta$. 68: Computer $\alpha$ has access to a file $A$ and $\beta$ has access to 69: file $B$, where $A$ and $B$ are ``similar''. There is a slow 70: communications link between $\alpha$ and $\beta$. 71: 72: The rsync algorithm consists of the following steps: 73: 74: \begin{enumerate} 75: \item $\beta$ splits the file $B$ into a series of non-overlapping 76: fixed-sized blocks of size S bytes\footnote{We have found that 77: values of S between 500 and 1000 are quite good for most purposes}. 78: The last block may be shorter than $S$ bytes. 79: 80: \item For each of these blocks $\beta$ calculates two checksums: 81: a weak ``rolling'' 32-bit checksum (described below) and a strong 82: 128-bit MD4 checksum. 83: 84: \item $\beta$ sends these checksums to $\alpha$. 85: 86: \item $\alpha$ searches through $A$ to find all blocks of length $S$ 87: bytes (at any offset, not just multiples of $S$) that have the same 88: weak and strong checksum as one of the blocks of $B$. This can be 89: done in a single pass very quickly using a special property of the 90: rolling checksum described below. 91: 92: \item $\alpha$ sends $\beta$ a sequence of instructions for 93: constructing a copy of $A$. Each instruction is either a reference 94: to a block of $B$, or literal data. Literal data is sent only for 95: those sections of $A$ which did not match any of the blocks of $B$. 96: \end{enumerate} 97: 98: The end result is that $\beta$ gets a copy of $A$, but only the pieces 99: of $A$ that are not found in $B$ (plus a small amount of data for 100: checksums and block indexes) are sent over the link. The algorithm 101: also only requires one round trip, which minimises the impact of the 102: link latency. 103: 104: The most important details of the algorithm are the rolling checksum 105: and the associated multi-alternate search mechanism which allows the 106: all-offsets checksum search to proceed very quickly. These will be 107: discussed in greater detail below. 108: 109: \section{Rolling checksum} 110: 111: The weak rolling checksum used in the rsync algorithm needs to have 112: the property that it is very cheap to calculate the checksum of a 113: buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and 114: the values of the bytes $X_1$ and $X_{n+1}$. 115: 116: The weak checksum algorithm we used in our implementation was inspired 117: by Mark Adler's adler-32 checksum. Our checksum is defined by 118: $$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$ 119: $$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$ 120: $$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$ 121: 122: where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$. 123: For simplicity and speed, we use $M = 2^{16}$. 124: 125: The important property of this checksum is that successive values can 126: be computed very efficiently using the recurrence relations 127: 128: $$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$ 129: $$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$ 130: 131: Thus the checksum can be calculated for blocks of length S at all 132: possible offsets within a file in a ``rolling'' fashion, with very 133: little computation at each point. 134: 135: Despite its simplicity, this checksum was found to be quite adequate as 136: a first-level check for a match of two file blocks. We have found in 137: practice that the probability of this checksum matching when the 138: blocks are not equal is quite low. This is important because the much 139: more expensive strong checksum must be calculated for each block where 140: the weak checksum matches. 141: 142: \section{Checksum searching} 143: 144: Once $\alpha$ has received the list of checksums of the blocks of $B$, 145: it must search $A$ for any blocks at any offset that match the 146: checksum of some block of $B$. The basic strategy is to compute the 147: 32-bit rolling checksum for a block of length $S$ starting at each 148: byte of $A$ in turn, and for each checksum, search the list for a 149: match. To do this our implementation uses a 150: simple 3 level searching scheme. 151: 152: The first level uses a 16-bit hash of the 32-bit rolling checksum and 153: a $2^{16}$ entry hash table. The list of checksum values (i.e., the 154: checksums from the blocks of $B$) is sorted according to the 16-bit 155: hash of the 32-bit rolling checksum. Each entry in the hash table 156: points to the first element of the list for that hash value, or 157: contains a null value if no element of the list has that hash value. 158: 159: At each offset in the file the 32-bit rolling checksum and its 16-bit 160: hash are calculated. If the hash table entry for that hash value is 161: not a null value, the second-level check is invoked. 162: 163: The second-level check involves scanning the sorted checksum list 164: starting with the entry pointed to by the hash table entry, looking 165: for an entry whose 32-bit rolling checksum matches the current value. 166: The scan terminates when it reaches an entry whose 16-bit hash 167: differs. If this search finds a match, the third-level check is 168: invoked. 169: 170: The third-level check involves calculating the strong checksum for the 171: current offset in the file and comparing it with the strong checksum 172: value in the current list entry. If the two strong checksums match, 173: we assume that we have found a block of $A$ which matches a block of 174: $B$. In fact the blocks could be different, but the probability of 175: this is microscopic, and in practice this is a reasonable assumption. 176: 177: When a match is found, $\alpha$ sends $\beta$ the data in $A$ between 178: the current file offset and the end of the previous match, followed by 179: the index of the block in $B$ that matched. This data is sent 180: immediately a match is found, which allows us to overlap the 181: communication with further computation. 182: 183: If no match is found at a given offset in the file, the rolling 184: checksum is updated to the next offset and the search proceeds. If a 185: match is found, the search is restarted at the end of the matched 186: block. This strategy saves a considerable amount of computation for 187: the common case where the two files are nearly identical. In 188: addition, it would be a simple matter to encode the block indexes as 189: runs, for the common case where a portion of $A$ matches a series of 190: blocks of $B$ in order. 191: 192: \section{Pipelining} 193: 194: The above sections describe the process for constructing a copy of one 195: file on a remote system. If we have a several files to copy, we can 196: gain a considerable latency advantage by pipelining the process. 197: 198: This involves $\beta$ initiating two independent processes. One of the 199: processes generates and sends the checksums to $\alpha$ while the 200: other receives the difference information from $\alpha$ and 201: reconstructs the files. 202: 203: If the communications link is buffered then these two processes can 204: proceed independently and the link should be kept fully utilised in 205: both directions for most of the time. 206: 207: \section{Results} 208: 209: To test the algorithm, tar files were created of the Linux kernel 210: sources for two versions of the kernel. The two kernel versions were 211: 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and 212: are separated by 5 released patch levels. 213: 214: Out of the 2441 files in the 1.99.10 release, 291 files had changed in 215: the 2.0.0 release, 19 files had been removed and 25 files had been 216: added. 217: 218: A ``diff'' of the two tar files using the standard GNU diff utility 219: produced over 32 thousand lines of output totalling 2.1 MB. 220: 221: The following table shows the results for rsync between the two files 222: with a varying block size.\footnote{All the tests in this section were 223: carried out using rsync version 0.5} 224: 225: \vspace*{5mm} 226: \begin{tabular}{|l|l|l|l|l|l|l|} \hline 227: {\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\ 228: {\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline 229: 230: 300 & 64247 & 3817434 & 948 & 5312200 & 5629158 & 1632284 \\ \hline 231: 500 & 46989 & 620013 & 64 & 1091900 & 1283906 & 979384 \\ \hline 232: 700 & 33255 & 571970 & 22 & 1307800 & 1444346 & 699564 \\ \hline 233: 900 & 25686 & 525058 & 24 & 1469500 & 1575438 & 544124 \\ \hline 234: 1100 & 20848 & 496844 & 21 & 1654500 & 1740838 & 445204 \\ \hline 235: \end{tabular} 236: \vspace*{5mm} 237: 238: In each case, the CPU time taken was less than the 239: time it takes to run ``diff'' on the two files.\footnote{The wall 240: clock time was approximately 2 minutes per run on a 50 MHz SPARC 10 241: running SunOS, using rsh over loopback for communication. GNU diff 242: took about 4 minutes.} 243: 244: The columns in the table are as follows: 245: 246: \begin{description} 247: \item [block size] The size in bytes of the checksummed blocks. 248: \item [matches] The number of times a block of $B$ was found in $A$. 249: \item [tag hits] The number of times the 16-bit hash of the rolling 250: checksum matched a hash of one of the checksums from $B$. 251: \item [false alarms] The number of times the 32-bit rolling checksum 252: matched but the strong checksum didn't. 253: \item [data] The amount of file data transferred verbatim, in bytes. 254: \item [written] The total number of bytes written by $\alpha$, 255: including protocol overheads. This is almost all file data. 256: \item [read] The total number of bytes read by $\alpha$, including 257: protocol overheads. This is almost all checksum information. 258: \end{description} 259: 260: The results demonstrate that for block sizes above 300 bytes, only a 261: small fraction (around 5\%) of the file was transferred. The amount 262: transferred was also considerably less than the size of the diff file 263: that would have been transferred if the diff/patch method of updating 264: a remote file was used. 265: 266: The checksums themselves took up a considerable amount of space, 267: although much less than the size of the data transferred in each 268: case. Each pair of checksums consumes 20 bytes: 4 bytes for the 269: rolling checksum plus 16 bytes for the 128-bit MD4 checksum. 270: 271: The number of false alarms was less than $1/1000$ of the number of 272: true matches, indicating that the 32-bit rolling checksum is quite 273: good at screening out false matches. 274: 275: The number of tag hits indicates that the second level of the 276: checksum search algorithm was invoked about once every 50 277: characters. This is quite high because the total number of blocks in 278: the file is a large fraction of the size of the tag hash table. For 279: smaller files we would expect the tag hit rate to be much closer to 280: the number of matches. For extremely large files, we should probably 281: increase the size of the hash table. 282: 283: The next table shows similar results for a much smaller set of files. 284: In this case the files were not packed into a tar file first. Rather, 285: rsync was invoked with an option to recursively descend the directory 286: tree. The files used were from two source releases of another software 287: package called Samba. The total source code size is 1.7 MB and the 288: diff between the two releases is 4155 lines long totalling 120 kB. 289: 290: \vspace*{5mm} 291: \begin{tabular}{|l|l|l|l|l|l|l|} \hline 292: {\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\ 293: {\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline 294: 295: 300 & 3727 & 3899 & 0 & 129775 & 153999 & 83948 \\ \hline 296: 500 & 2158 & 2325 & 0 & 171574 & 189330 & 50908 \\ \hline 297: 700 & 1517 & 1649 & 0 & 195024 & 210144 & 36828 \\ \hline 298: 900 & 1156 & 1281 & 0 & 222847 & 236471 & 29048 \\ \hline 299: 1100 & 921 & 1049 & 0 & 250073 & 262725 & 23988 \\ \hline 300: \end{tabular} 301: \vspace*{5mm} 302: 303: 304: \section{Availability} 305: 306: An implementation of rsync which provides a convenient interface 307: similar to the common UNIX command rcp has been written and is 308: available for download from http://rsync.samba.org/ 309: 310: \end{document}