This implementation is based on this paper by Athicha Muthitacharoen et. al (A Low-bandwidth Network File System). The main goal of the algorithm is to slice a file into different chunks so that changes to parts of files affect as less chunks as possible. For example, with a fixed chunk size, inserting one character at the beginning of a file will change all the chunks. With this algorithm, however, very often only the first chunk is detected as changed while remaining chunks are unchanged. A rolling Rabin signature of 48 bytes is used to detect boundaries of chunks.
class Program { static void Main(string[] args) { Compare("46M.txt", "46M 2.txt", 0x1FFF); } static void Compare(string file1, string file2, ulong boundary) { string A = ""; string B = ""; using (StreamReader r = new StreamReader(file1)) { A = r.ReadToEnd(); } using (StreamReader r = new StreamReader(file2)) { B = r.ReadToEnd(); } Stopwatch watch = new Stopwatch(); watch.Start(); var listA = Slice(A, boundary); watch.Stop(); long timeA = watch.ElapsedMilliseconds; watch.Reset(); watch.Start(); var listB = Slice(B, boundary); watch.Stop(); long timeB = watch.ElapsedMilliseconds; int i = 0; Console.WriteLine(string.Format("Chunks in file '{0}'\t: {1}", new FileInfo(file1).Name, listA.Count)); Console.WriteLine(string.Format("Chunks in file '{0}'\t: {1}", new FileInfo(file2).Name, listB.Count)); Console.WriteLine("\n--------MATCHED CHUNKS--------\n"); int matchCount = 0; foreach (string s in listA) { if (listB.IndexOf(s) >= 0) { Console.WriteLine("{0:D4}<-->{1:D4} (size:{2:D8} bytes)", i, listB.IndexOf(s), s.Length); matchCount++; } else Console.WriteLine("{0:D4} * (size:{1:D8} bytes)", i, s.Length); i++; } Console.WriteLine("\n------------------------------\n"); Console.WriteLine("Matched blockes {0} (out of {1})", matchCount, listA.Count); Console.WriteLine("Time used to process file '{0}'\t:{1:f3} (sec)", new FileInfo(file1).Name, timeA / 1000.0); Console.WriteLine("Time used to process file '{0}'\t:{1:f3} (sec)", new FileInfo(file2).Name, timeB / 1000.0); } static List<string> Slice(string s, ulong boundary) { List<string> ret = new List<string>(); ulong Q = 100007; //Use a much large prime number in your code! ulong D = 256; ulong pow = 1; int windowSize = 48; for (int k = 1; k < windowSize; k++) pow = (pow * D) % Q; ulong sig = 0; int lastIndex = 0; for (int i = 0; i < windowSize; i++) sig = (sig * D + (ulong)s[i]) % Q; for (int j = 1; j <= s.Length - windowSize; j++) { sig = (sig + Q - pow * (ulong)s[j - 1] % Q) % Q; sig = (sig * D + (ulong)s[j + windowSize -1 ]) % Q; if ((sig & boundary) == 0) { if (j + 1 - lastIndex >= 2048) { ret.Add(s.Substring(lastIndex, j + 1 - lastIndex)); lastIndex = j + 1; } } else if (j + 1 - lastIndex >= 65536) { ret.Add(s.Substring(lastIndex, j + 1 - lastIndex)); lastIndex = j + 1; } } if (lastIndex < s.Length - 1) ret.Add(s.Substring(lastIndex)); return ret; } }
The core algorithm implementation is in Slice() method. To test the code, you’ll need a large text file (in my case I used a file named “46M.txt”, which has, well, 46M in size). Make a copy of the file and insert some new texts randomly at different locations and name the new file “46M 2.txt”. The code generated the following outputs with my test data:
As you can see the code successfully matched 1962 out of 1968 chunks despite of the random changes made in the second file. Because the code reads all contents into memory, it doesn't handle very large files. However extending the code to directly work off a stream should be an easy task for interested readers.
0 comments:
Post a Comment