QReadWriteLock gets faster in Qt 5.7

Datetime:2016-08-23 04:20:23          Topic: Qt           Share

In Qt 5.0 already, QMutex got revamped to be fast. In the non contended case, locking and unlocking is basically only a simple atomic instruction and it does not allocate memory making it really light. QReadWriteLock however did not get the same optimizations. Now in Qt 5.7, it gets on par with QMutex.

QReadWriteLock

QReadWriteLock 's purpose is about the same as the one of QMutex : to protect a critical section. The difference is that QReadWriteLock proposes two different locking modes: read or write. This allows several readers to access the critical section at the same time, and therefore be potentially more efficient than a QMutex. At least that was the intention. But the problem is that, before Qt 5.7, QReadWriteLock's implementation was not as optimized as QMutex. In fact, QReadWriteLock was internally locking and unlocking a QMutex on every call (read or write). So QReadWriteLock was in fact slower than QMutex, unless the read section was held for a very long time, under contention.

For example, the internals of QMetaType were using a QReadWriteLock for the QMetaType database. This makes sense because that database is accessed very often for reading (every time one creates or operates on a QVariant) and very seldom accessed for writing (only when you need to register a new type the first time it is used). However, the QReadWriteLock locking (for read) was so slow that it took a significant amount of time in some QML application that use lots of QVariants, for example with Qt3D.

It was even proposed to replace QReadWriteLock by a QMutex within QMetaType . This would have saved 40% of the time of the QVariant creation. This was not necessary because I improved QReadWriteLock in Qt 5.7 to make it at least as fast as QMutex.

QMutex

QMutex is itself quite efficient already. I described the internals of QMutex in a previous article . Here is a reminder on the important aspects on QMutex:

  • sizeof(QMutex) == sizeof(void*) , without any additional allocation.
  • The non-contended case is basically only an atomic operation for lock or unlock
  • In case we need to block, fallback to pthread or native locking primitives

QReadWriteLock in Qt 5.7

I optimized QReadWriteLock to bring it on par with QMutex, using the the same implementation principle.

QReadWriteLock only has one member: a QAtomicPointer named d_ptr . Depending on the value of d_ptr , the lock is in the following state:

  • When d_ptr == 0x0 (all the bits are 0): unlocked and non-recursive. No readers or writers holding nor waiting on it.
  • When d_ptr & 0x1 (the least significant bit is set): one or several readers are currently holding the lock. No writers are waiting and the lock is non-recursive. The amount of readers is (d_ptr >> 4) + 1 .
  • When d_ptr == 0x2 : we are locked for write and nobody is waiting.
  • In any other case, when the two least significant bits are 0 but the remaining bits contain data, d_ptr points to a QReadWriteLockPrivate object which either means that the lock is recursive, or that it is currently locked and threads are possibly waiting. The QReadWriteLockPrivate has a condition variable allowing to block or wake threads.

In other words, the two least significant bits encode the state. When it is a pointer to a QReadWriteLock, those two bits will always be 0 since pointers must be properly aligned to 32 or 64 bits addresses for the CPU to use them.

This table recap the state depending on the two least significant bits:

if d_ptr is fully 0, then the lock is unlocked .
00 If d_ptr is not fully 0, it is pointer to a QReadWriteLockPrivate .
01 One or several readers are currently holding the lock. The amount of reader is (d_ptr >> 4) + 1 .
10 One writer is holding the lock and nobody is waiting

We therefore define a few constants to help us read the code.

enum {
    StateMask = 0x3,
    StateLockedForRead = 0x1,
    StateLockedForWrite = 0x2,
};
const auto dummyLockedForRead = reinterpret_cast<QReadWriteLockPrivate *>(quintptr(StateLockedForRead));
const auto dummyLockedForWrite = reinterpret_cast<QReadWriteLockPrivate *>(quintptr(StateLockedForWrite));
inline bool isUncontendedLocked(const QReadWriteLockPrivate *d)
{ return quintptr(d) & StateMask; }

Aside: The code assumes that the null pointer value is equal to binary 0, which is not guaranteed by the C++ standard, but holds true on every supported platform.

lockForRead

The really fast case happens when there is no contention. If we can atomically swap from 0 to StateLockedForRead , we have the lock and there is nothing to do. If there already are readers, we need to increase the reader count, atomically. If a writer already holds the lock, then we need to block. In order to block, we will assign a QReadWriteLockPrivate and wait on its condition variable. We call QReadWriteLockPrivate::allocate() which will pop an unused QReadWriteLockPrivate from a lock-free stack (or allocates a new one if the stack is empty). Indeed, we can never free any of the QReadWriteLockPrivate as another thread might still hold pointer to it and de-reference it. So when we release a QReadWriteLockPrivate, we put it in a lock-free stack.

lockForRead actually calls tryLockForRead(-1) , passing -1 as the timeout means "wait forever until we get the lock".

Here is the slightly edited code. (original)

Hover over the code to see fancy tool tips powered by the Woboq Code Browser !

bool QReadWriteLock::tryLockForRead(int timeout)
{
    // Fast case: non contended:
    QReadWriteLockPrivate *d;
    if (d_ptr.testAndSetAcquire(nullptr, dummyLockedForRead, d))
        return true;

    while (true) {
        if (d == 0) {
            if (!d_ptr.testAndSetAcquire(nullptr, dummyLockedForRead, d))
                continue;
            return true;
        }

        if ((quintptr(d) & StateMask) == StateLockedForRead) {
            // locked for read, increase the counter
            const auto val = reinterpret_cast<QReadWriteLockPrivate *>(quintptr(d) + (1U<<4));
            if (!d_ptr.testAndSetAcquire(d, val, d))
                continue;
            return true;
        }

        if (d == dummyLockedForWrite) {
            if (!timeout)
                return false;

            // locked for write, assign a d_ptr and wait.
            auto val = QReadWriteLockPrivate::allocate();
            val->writerCount = 1;
            if (!d_ptr.testAndSetOrdered(d, val, d)) {
                val->writerCount = 0;
                val->release();
                continue;
            }
            d = val;
        }
        Q_ASSERT(!isUncontendedLocked(d));
        // d is an actual pointer;

        if (d->recursive)
            return d->recursiveLockForRead(timeout);

        QMutexLocker lock(&d->mutex);
        if (d != d_ptr.load()) {
            // d_ptr has changed: this QReadWriteLock was unlocked before we had
            // time to lock d->mutex.
            // We are holding a lock to a mutex within a QReadWriteLockPrivate
            // that is already released (or even is already re-used). That's ok
            // because the QFreeList never frees them.
            // Just unlock d->mutex (at the end of the scope) and retry.
            d = d_ptr.loadAcquire();
            continue;
        }
        return d->lockForRead(timeout);
    }
}

lockForWrite

Exactly the same principle, as lockForRead but we would also block if there are readers holding the lock.

bool QReadWriteLock::tryLockForWrite(int timeout)
{
    // Fast case: non contended:
    QReadWriteLockPrivate *d;
    if (d_ptr.testAndSetAcquire(nullptr, dummyLockedForWrite, d))
        return true;

    while (true) {
        if (d == 0) {
            if (!d_ptr.testAndSetAcquire(d, dummyLockedForWrite, d))
                continue;
            return true;
        }

        if (isUncontendedLocked(d)) {
            if (!timeout)
                return false;

            // locked for either read or write, assign a d_ptr and wait.
            auto val = QReadWriteLockPrivate::allocate();
            if (d == dummyLockedForWrite)
                val->writerCount = 1;
            else
                val->readerCount = (quintptr(d) >> 4) + 1;
            if (!d_ptr.testAndSetOrdered(d, val, d)) {
                val->writerCount = val->readerCount = 0;
                val->release();
                continue;
            }
            d = val;
        }
        Q_ASSERT(!isUncontendedLocked(d));
        // d is an actual pointer;

        if (d->recursive)
            return d->recursiveLockForWrite(timeout);

        QMutexLocker lock(&d->mutex);
        if (d != d_ptr.load()) {
            // The mutex was unlocked before we had time to lock the mutex.
            // We are holding to a mutex within a QReadWriteLockPrivate that is already released
            // (or even is already re-used) but that's ok because the QFreeList never frees them.
            d = d_ptr.loadAcquire();
            continue;
        }
        return d->lockForWrite(timeout);
    }
}

unlock

The API has a single unlock for both read and write so we don't know if we are unlocking from reading or writing. Fortunately, we can know that with the state encoded in the lower bits. If we were locked for read, we need to decrease the reader count, or set the state to 0x0 if we are the last one. If we were locked for write we need to set the state to 0x0 . If there is a QReadWriteLockPrivate , we need to update the data there, and possibly wake up the blocked threads.

void QReadWriteLock::unlock()
{
    QReadWriteLockPrivate *d = d_ptr.load();
    while (true) {
        Q_ASSERT_X(d, "QReadWriteLock::unlock()", "Cannot unlock an unlocked lock");

        // Fast case: no contention: (no waiters, no other readers)
        if (quintptr(d) <= 2) { // 1 or 2 (StateLockedForRead or StateLockedForWrite)
            if (!d_ptr.testAndSetRelease(d, nullptr, d))
                continue;
            return;
        }

        if ((quintptr(d) & StateMask) == StateLockedForRead) {
            Q_ASSERT(quintptr(d) > (1U<<4)); //otherwise that would be the fast case
            // Just decrease the reader's count.
            auto val = reinterpret_cast<QReadWriteLockPrivate *>(quintptr(d) - (1U<<4));
            if (!d_ptr.testAndSetRelease(d, val, d))
                continue;
            return;
        }

        Q_ASSERT(!isUncontendedLocked(d));

        if (d->recursive) {
            d->recursiveUnlock();
            return;
        }

        QMutexLocker locker(&d->mutex);
        if (d->writerCount) {
            Q_ASSERT(d->writerCount == 1);
            Q_ASSERT(d->readerCount == 0);
            d->writerCount = 0;
        } else {
            Q_ASSERT(d->readerCount > 0);
            d->readerCount--;
            if (d->readerCount > 0)
                return;
        }

        if (d->waitingReaders || d->waitingWriters) {
            d->unlock();
        } else {
            Q_ASSERT(d_ptr.load() == d); // should not change when we still hold the mutex
            d_ptr.storeRelease(nullptr);
            d->release();
        }
        return;
    }
}

Benchmarks

Here is the benchmark that was run: https://codereview.qt-project.org/167113/ . The benchmark was run with Qt 5.6.1, GCC 6.1.1. What I call Qt 5.7 bellow is in fact Qt 5.6 + the QReadWriteLock patch so it only compares this patch.

Uncontended

This benchmark compares different types of lock by having a single thread running in a loop 1000000 times, locking and unlocking the mutex and doing nothing else.

QReadWriteLock (Qt 5.6) 38 ms ███████████████████
QReadWriteLock (Qt 5.7) 18 ms █████████
QMutex 16 ms ████████
std::mutex 18 ms █████████
std::shared_timed_mutex 33 ms ████████████████▌

Contented Reads

This benchmark runs as much threads as logical cores (4 in my cases). Each thread will lock and unlock the same mutex 1000000 times. We do a small amount of work inside and outside the lock. If no other work was done at all and the threads were only locking and unlocking, we would have a huge pressure on the mutex but this would not be a fair benchmark. So this benchmark does a hash lookup inside the lock and a string allocation outside of the lock. The more work is done inside the lock, the more we disadvantage QMutex compared to QReadWriteLock because threads would be blocked for longer time.

QReadWriteLock (Qt 5.6) 812 ms ████████████████████▍
QReadWriteLock (Qt 5.7) 285 ms ███████▏
QMutex 398 ms ██████████
std::mutex 489 ms ████████████▎
std::shared_timed_mutex 811 ms ████████████████████▎

Futex Version

On platforms that have futexes , QMutex does not even need a QMutexPrivate, it uses the futexes to hold the lock. Similarly, we could do the same with QReadWriteLock. I made an implementation of QReadWriteLock using futex (in fact I made it first before the generic version). But it is not in Qt 5.7 and is not yet merged in Qt, perhaps for a future version if I get the motivation and time to get it merged.

Could we get even faster?

As always, nothing is perfect and there is always still room for improvement. A flaw of this implementation is that all the readers still need to perform an atomic write at the same memory location (in order to increment the reader's count). This causes contention if there are many reader threads. For cache performance, we would no want that the readers write to the same memory location. Such implementations are possible and would make the contended case faster, but then would take more memory and might be slower for the non-contended case.

Conclusion

These benchmarks shows the huge improvement in QReadWriteLock in Qt 5.7. The Qt classes have nothing to envy to their libstdc++ implementation. std::shared_timed_mutex which would be the standard equivalent of a QReadWriteLock is surprisingly slow. (I heard rumors that it might get better.)

It is optimized for the usual case of Qt with relatively low contention. It is taking a very small amount of memory and makes it a pretty decent implementation of a read write lock.

In summary, you can now use QReadWriteLock as soon as there are many reads and seldom writes. This is only about non recursive mutex. Recursive mutex are always slower and should be avoided. Not only because they are slower, but it is also harder to reason about them.





About List