假设我们有一些文件,名为,例如,a.txt中。我们知道,有是一些重复的元素。 a.txt中比内存大的非常大,十次以上,也许各地50GB。有时,乙的大小大约等于A的大小,有时会许多的倍A的尺寸小 让它有这样的结构:
Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A. Let it have structure like that:
a 1 b 2 c 445 a 1我们需要得到的文件B.txt,不会有这样的重复。作为例子,应该是这样的:
We need to get file "B.txt", that will not have such duplicates. As example, it should be this:
a 1 b 2 c 445
我想过算法,复制并执行B,然后把B中第一个字符串,并寻找每一个,如果发现相同,删除重复项。然后把第二个字符串,等等。
I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.
但我认为这是办法太慢。我可以使用哪些?
But I think it is way too slow. What can I use?
A是的不可以数据库!没有SQL,请。的
A is not database! No SQL, please.
抱歉,但这并不表示,排序就可以了。
Sorry, that didn't said, sorting is OK.
虽然可以进行排序,那么如果它不能进行排序?的
推荐答案假设你能适应 1 / K '个文件到内存中,仍然有余地工作数据结构。整个文件可以在 K 或更少的通行证,如下进行处理,而这具有比排序整个文件取决于线路长和排序算法快得多的机会常数。排序平均为O(n LN N)和下面的过程是 O(KN)最坏的情况。例如,如果平均行10个字符,并有n = 5G线, LN(N)〜22.3 。此外,如果你的输出文件 B 大于输入文件 A 小很多,这个过程很可能只需要一或两遍。
Suppose you can fit 1/k'th of the file into memory and still have room for working data structures. The whole file can be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting the whole file depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 5G lines, ln(n) ~ 22.3. In addition, if your output file B is much smaller than the input file A, the process probably will take only one or two passes.
流程:
请注意,每个传球后,澳的前M行中不重复的,没有那些M行是重复的O.其余因此,至少 1 / K '个原始文件的每道加工,所以处理的最多 K 通过。
Note that after each pass, the first M lines of O contain no duplicates, and none of those M lines are duplicated in the rest of O. Thus, at least 1/k'th of the original file is processed per pass, so processing takes at most k passes.
更新1 而不是反复写出来,读回的已处理领导线,一个单独的输出文件p应使用哪个过程缓冲区R被附加在每遍结束。这通过一个因素减少阅读和写作量 K / 2 时,结果文件B是几乎一样大A,或者稍差,当B比A要小得多;但在任何情况下,它增加I / O量。
Update 1 Instead of repeatedly writing out and reading back in the already-processed leading lines, a separate output file P should be used, to which process buffer R is appended at the end of each pass. This cuts the amount of reading and writing by a factor of k/2 when result file B is nearly as large as A, or by somewhat less when B is much smaller than A; but in no case does it increase the amount of I/O.