V4/OTP Replay Prevention
- 1 Overview
- 2 TOTP Replay
- 3 Local Replay
- 4 Replication Counter Decrement
- 5 Replication Counter Race
When deploying a 2FA solution using OTPs, it is important to ensure that OTPs are indeed One-Time Passwords. This means some effort needs to be taken to ensure that replay attacks cannot be successful. In an independently replicated environment such as LDAP, this is difficult and may even be logically impossible. It is thus important to mitigate this risk as much as possible while clearly defining any weaknesses in such a system.
TOTP allows replays within a certain window from the current time (and token time offset).
We will implement a watermark system which records that last interval used. The OTP validation system will not permit token windows to be used if they are equal to or less than the watermark. Similar to the HOTP counter, this watermark will be replicated to other systems. If only a local watermark is desired (preventing replays on a single server, but not globally), replication of this attribute can be disabled.
Resolved: Ticket #4410
It is possible to simultaneously execute multiple authentications against a server. This may cause race conditions with the HOTP counter or (less likely) the TOTP watermark. It may also permit a rapid replay attack.
The OTP validation code should lock the token's LDAP entry during authentication. In order to avoid making it easy to cause a DoS attack on the server, if the aforementioned lock cannot be acquired within a (very) short period, the authentication request should be denied.
Resolved: Ticket #4493
Replication Counter Decrement
It is theoretically possible that an unknown, system-wide race condition might cause an HOTP counter or TOTP watermark to move backwards. This would permit tokens to be reused.
Develop a 389 DS plugin which, upon receiving a replication request for an HOTP counter or TOTP watermark, evaluates the current value against the replication value. If the replication value is less than the local value, the replication request should be discarded.
NOTE: this is a defensive measure against unknown race conditions. The race conditions should still be addressed directly when discovered.
NOTE: this plugin may provide additional benefit if using the synchronous synchronization fix mentioned below.
Resolved: Ticket #4494
Replication Counter Race
Assuming all local server race conditions are prevented, there is still the system-wide replication race condition where a user can log in to two servers simultaneously. This is due to the fact that replication is asynchronous and takes an unknown period of time. During this replication period, the counter/watermark values will differ on different servers, permitting the reuse of an OTP code.
This is a difficult problem to solve. Below is a list of designs which may eliminate or greatly mitigate the problem.
Expand the replication system to permit some attributes to be defined as priority replication. This would ensure that when a remote system receives a priority replication request, it is processed before any other requests.
- Total system replication load is high due to frequent counter/watermark replication.
- While this would help decrease the rise in replication time during conditions of heavy load, it would not eliminate simultaneous multi-server authentication.
In this fix, we do not permit authentication unless the state can be consistent across the cluster. This is done by disabling replication on the counter/watermark and performing synchronization manually using the following procedure.
When a server receives an authentication request, it attempts to verify the request locally. If this fails, the server will request the counter values for all the user's tokens from all other servers, merging the highest token values to the local store and attempting the verification again. This step ensures cluster consistency. If local verification succeeds, the server will attempt to modify the token counter/watermark value on all remote servers. If this modification fails on any server, the token has been used elsewhere and the server should fail the request. If the remote modifications are successfully applied, the state of the cluster is consistent and the authentication should succeed.
A note about unavailable nodes is necessary. If a node in the cluster is unavailable, quorum should be used to determine if authentication is successful. If quorum fails, the authentication should fail by default (strict mode). There should be an option to disable strict mode. In this case, authentication will succeed when quorum is not present with the caveat that replays may occur. This would handle the case of a split-brain.
Because of the properties of this system, when a server has invalid state, it will gradually come into consistency with the state of the cluster as authentications occur. No replication is needed.
One alternative to this design is to limit the authentication servers to a subset of the total cluster. In cluster with a large number of replicas, this will increase performance.
- Total system replication load is none.
- Longer authentication times.
Dynamic Master Proxying
Add an attribute to each user defining the user's master server. If a bind request is received on the server which is the user's master server, all authentication proceeds as normal. Otherwise, the bind request is proxied to the user's master server. If the master server is unavailable, the slave servers can either reject the request or process it locally based on policy; in the latter case, replays will be possible (within the replication window).
The above design presumes that the master server is manually set and static. In some deployments, this may be desirable. However, based on some currently undefined policy, it should also be possible to transfer the master automatically.
First, using a collection of local statistics, the master can be transferred to the slave the user uses most often. In this case, the master is still available but is less popular than one of the slaves. The slave can request to become the master simply by asking the master to change the attribute's value to itself and, if successful, updating the local attribute to the same value (to avoid replication delay). Other servers would receive this change via replication. Requests to the old master during the replication window would be automatically proxied to the new master. Optionally, the response to the master change request could be accompanied by a control containing the counters/watermarks for all the user's tokens. This would allow the master/slave switch-over to avoid any replays while pending replications that have not arrived.
Second, if the master were to disappear from the cluster or if the attribute is undefined, the slave can usurp master status by changing the user's master server attribute and completing the request locally. If other slaves simultaneously receive an authentication request for the same user, they will also attempt to usurp master status. Replication will eventually resolve this. But until that point, replay attacks will be possible. Combining this approach with priority replication could reduce this period.
Optionally, since authentications are only permitted on a single master, we could disable replication on the watermark (but not the counter), eliminating a large amount of replication traffic. This would allow only a very small number of replays (currently 7 at max) for only a limited time and only when during a master/slave switch-over. If using the optional master-change control defined above, replays would only be possible during usurpation (where they are always possible).
- Total system replication load is medium to high due to frequent counter (and optionally watermark) replication.
- Cross-server replay attacks are possible during the limited usurpation window.
- Cross-server replay attacks may be possible during a change-over request (without optional control).
- Implementation is complex.
Document the Issue
We simply document that, while difficult to exploit due to comprehensive encryption, replay attacks are possible across the system.
Optionally, we could reduce replication load by not replicating the TOTP watermark. This would provide only local server TOTP replay prevention. HOTP counters must still be replicated.
- Total system replication load is medium due to frequent counter replication.
- Cross-server replay attacks are possible within the replication window.
This issue is recorded in Ticket #4443.
Due to the difficulty of solving this problem, we have elected to simply document the problem for 4.1. The best solution appears to be synchronous synchronization. We expect to implement this method for some future release.