Removing duplicates from lists quickly

Datetime:2017-04-11 05:25:37         Topic: STL          Share        Original >>
Here to See The Original Article!!!

Suppose you have lists of numbers where some values are repeated (e.g., 1,1,2,3,3,3,4). You want these duplicates to be removed (e.g., 1,2,3,4). To avoid potentially expensive memory allocations, we want to solve the problem in-place, writing back the answer is the current array. This is a surprisingly common problem.

To set a reference, suppose I generate 1024 random numbers in the range [0,1024) and I sort them. This will generate a few repeated values. I want to remove them.

I use integers for my test, but we could equally work with pointers to strings or arbitrary objects.

In C++, we have an STL function for this very purpose: std::unique . On a recent Intel processor, it takes over 11 cycles per value in the array.

You might assume that this result cannot be improved much. Let us see how fast we can go.

You can gain a little bit of efficiency over STL by writing your own function:

size_t unique(uint32_t *out, size_t len) {
    if(len ==  0) return 0; 
    size_t pos = 1;
    uint32_t oldv = out[0];
    for (size_t i = 1; i < len; ++i) {
        uint32_t newv = out[i];
        if (newv != oldv) { 
            out[pos++] = newv;
        }
        oldv = newv;
    }
    return pos;
}

Somehow, this saves about one cycle per array value (we are just under 11 cycles per value). I am not sure why it seems to be a tiny bit faster.

The main benefit of writing our own function, however, is that it gives us a chance to think about the algorithm.

What hurts us in this code are the mispredicted branches that occur when I compare the new value with the previous one. Because I have few repetitions, the processor predicts that there will be none at all. When a repetition does occur, the pipeline must unwind and fix the problems caused by the mispredicted branch.

We can multiply the speed by about a factor of 4 (to less than 3 cycles per array value) with a branchless approach:

static size_t hope_unique(uint32_t *out, size_t len) {
    if(len ==  0) return 0; // duh!
    size_t pos = 1;
    uint32_t oldv = out[0];
    for (size_t i = 1; i < len; ++i) {
        uint32_t newv = out[i];
        out[pos] = newv;
        pos += (newv != oldv);
        oldv = newv;
    }
    return pos;
}

Can we do better? It turns out that using SIMD instructions (with the AVX2 instruction set), we can get to about 1 cycle per array value. In that case, the code is not only branchless, but it also operates on vectors of several values at once…

int _avx_unique_store(__m256i ov, __m256i nv, __m256i *o) {
    __m256i wpm  = _mm256_set_epi32(0,-1,-1,-1,-1,-1,-1,-1);
    __m256i wipelast = _mm256_and_si256(nv,wpm);
    __m256i ovkeeplast = _mm256_andnot_si256(wpm, ov);
    __m256i recon  = _mm256_or_si256(wipelast,ovkeeplast);
    const __m256i mbom = _mm256_set_epi32(6,5,4,3,2,1,0,7);
    __m256i vT = _mm256_permutevar8x32_epi32(recon,mbom);
    int M = _mm256_movemask_ps(_mm256_cmpeq_epi32(vT, nv));
    int N =  8 - _mm_popcnt_u64(M);
    __m256i key = _mm256_loadu_si256(uniqshuf + M);
    __m256i val =_mm256_permutevar8x32_epi32(nv,key);
    _mm256_storeu_si256(o, val);
    return N;
}

I realize that the vectorized code looks like gibberish but my goal is to assess the benefits over vectorization.

With vectorization, we are fully one order of magnitude faster than STL’s std::unique function.

As usual, my code is freely available on GitHub .








New