View on GitHub

RadarGun

Benchmarking framework for data grids and distributed caches

Failover tests

This page should describe the operations executed using BackgroundOperationsManager, started|stopped using BackgroundStressors(Start|Stop)Stage, and checked using BackgroundStressorsCheckStage. Other stages used during failover testing such as Service(Start|Stop) stage or WaitForTopologyTopology(Event|Settle)Stage won't be described, these are more or less trivial, refer to their documentation.

Legacy and LogValues

Initially we have used tactics known now as Legacy: we spawn several threads issuing a put/get operation every 100 ms (check property delayBetweenRequests scoped into GeneralConfiguration) - those threads are not terminated after the end of the stage but kept working on the background. There are another threads to retrieve and record statistics. In order to check whether the cache has all entries, we have used CheckCacheDataStage.

This approach works for checking the lag of response times, see how the throughput drops when one node crashes etc. That makes use of this 'legacy logic ' valid for cases, where we want to check the performance during failover.

However, it's not a good functional test: the stressors mindlessly overwrite all the entries (not firing any alarm if the entry is lost) and by the time CheckCacheDataStage runs, the entry can be already loaded again. Even if those checks were in place, only really missing entries would be detected; but there's much more that can go wrong: not applied writes, double writes, inconsistencies and so on.

That's why 'log values logic' was devised. This is activated by setting property log-logic-enabled to true. It is similar to legacy logic in some ways: we start some threads and let them run on the background. Then we run another stages, restarting the nodes and whatever we like, but every while we insert the BackgroundStressorsCheckStage: this has very low overhead, it just checks whether some error was detected and fails the benchmark if it was.

It's important to note that the implementation of log checkers was developed and tested only on Infinispan. There can be subtle oddities for example in the transactions behaviour (ACID? LOL!) that were already fixed but there could be false positives for other caches/grids.

The principle of log logic is quite simple: we don't use opaque fixed-size values as in legacy mode (where performance is the top priority) but the values are append-only lists. There are two types of threads: stressor threads, which have deterministic sequence of accessing the entries, and during each operation append operationId to the value (reading the value first, appending operationId locally and writing the entry again). Then there are checker threads, which know the sequence of stressors and follow them, checking that each operation was correctly appended.

Above I have written that we always append to the value. How do we test removes, then? In fact, for each logical key there are two entries: under key 'key_0000000000001234' and 'key_FFFFFFFFFFFFEDCB' - see, the binary complement of the number. When we want to execute remove, we append operationId to the value and write it under the complementary key. Then we can safely remove the value under primary key.

The stressors don't keep any data about their entries in memory but the sequence seed and current operationId. When they want to execute operation on given key, they have to load the entry first. Therefore, if the remove was not executed (we would read the old value), we would continue without the operationId that 'removed' the entry and checkers would soon detect that.

By default there are 10 stressor threads on each node. The checkers on one node have to check all stressors from all nodes in order to guarantee that each node has correct view of the data. However, we don't keep a checker for any stressor on each node: instead, just the expected state of those stressors are kept and we have a pool of checkers which check those stressors.

The checkers must know what was the last operation of any stressor (otherwise, if the checker was faster, it would report missing operations because the stressor did not managed to execute it so far). All communication happens through cache: the stressor regularly writes its last operationId and seed for the deterministic sequence under a special key. This entry is also read by the stressor after the node where the stressor resides is restarted, so that it know where to begin from.

If we executed the test for long enough, the log values would grow without bounds. That's why each checker stores the last checked operation in the cache - if stressor finds out that the entry has grown too big, it finds the highest operationId that was already checked on all nodes and truncates operationIds up to this value. The checkers also use their last checked record when the node is restarted.

There's one more option 'ignoreDeadCheckers' that can be useful for elasticity tests: if the cluster shrinks to single node, the stressors on one node cannot progress as the local writes are fast, the checkers on dead nodes cannot follow and the entry grows to max size soon. That's why the stressor may decide to truncate entries that were not checked, but records this under special key so that the checkers that are spawned after the other nodes start again don't panic.

If property sharedKeys is set to false (this is the default), each stressor thread operates on its private set of keys. Basic operations (put, get, remove) are enough for that. However, when you set sharedKeys to true, all stressors update the same set concurrently. Here we need the conditional operations such as putIfAbsent, conditional replace and conditional remove. Also, the entry value (aka log value) has to keep stressorId along with each operationId.

Private log logic

This is the simplest mode: the stressor retrieves primary value, and if it is null, it retrieves complementary key. Then it writes the new value (either overwriting the primary key or backing up to complementary entry) and then removes the other entry.

The checker tries to retrieve primary value, and if it is null, retrieves the complementary one. There's a chance that the stressor managed to 'move' the value between those two gets, therefore, the checker tries the process above 100 times to make sure the value was really lost. When it manages to retrieve some value, it must contain the operationId (it doesn't matter whether this is the older or newer one).

Shared log logic

The stressor has to always retrieve both values (primary and complementary), and the new value is union of those two with the operationId appended. That's because we cannot update those two values atomically (without transactions), we have to deal with the fact that concurrent operations may fails.

Later it tries to write the new value under appropriate key: however it has to use conditional operation to make sure that the entry was not overwritten after the old value was retrieved. If this conditional operation fails, we have to start from the beginning (just for this operation, naturally). The removal of the other entry is also conditional, but if this one fails, we don't repeat the operation - another thread has probably again moved the value.

The checker has to cope with that: it tries to find the operationId for 100 times from either primary or complementary entry. When it is not successful after that many attempts, error is recorded.

Transactions

The basic difference is simple: we just execute begin() and commit() every few operations. If the commit fails, we have to start from the beginning of the transaction - we have to be able to rollback the stressor seed to this moment.

However, there's one unexpected behaviour of Infinispan: reading two values (even within transaction) that are modified inside single transaction can retrieve one value before the transaction and second after that. Multiple writes are not atomic with respect to reads.

That means, that we have to:

Clustered listeners

Clustered listeners is quite a new feature, and support in RG was not thoroughly tested yet. However, the implementation is quite simple: we register clustered listener for every operation and then record those operations to the same record as regular confirmation from the checker thread. Then, check for this stressor cannot follow until the notification is receiver as well as the value through regular get.

Component overview

Failover tests - Class diagram

Checking algorithm (general)

  1. Take record (queue head) from StressorRecordPool
  2. If last unsuccessful timestamp is set, add record confirmation (Last operation key) to mark the operation that we're sure is written into the cache
  3. Find value for given record - get operation(s) will be performed on a cache, where key is obtained via StressorRecord.getKeyId
  4. Check whether current operation can be found in the value from step #3
    1. True - set record timestamps (successful to current time, unsuccessful to undefined), remove all confirmations with operationId <= record.currentOperationId, generate new key for given record, increment operation id
    2. False - check whether confirmations contain operation with operationId > currentOperation
      1. If not, check whether operation for given stressor shouldn't be ignored (see Ignored key)
      2. If not, record missing operation in FailureManager
      3. Set record timestamp (unsuccessful to current time, remove all confirmations with operationId <= record.currentOperationId, generate new key for given record, increment operation id
  5. Return record back to StressorRecordPool

Data written into cache during test

Checker

Stressor