Jump to: navigation, search

Backup and Restore

What is Backup and Restore?

In many cases there is a lot of confusion about what backup and restore procedures are destined to solve. On the surface it sounds simple. Back up data and save it aside; then when something goes wrong take the saved data and copy it back. What could be simpler? However when multi-instance deployment, different versions or configurations are factored in, the backup and restoration becomes a real challenge.

If we step back and ask a question why would one need a backup and restore or in other words a 'disaster recovery' procedure? Answer is - to recover from a disaster! Let us see what kind of disasters we want to recover from and how capability that already exists in the FreeIPA can be used to overcome these scenarios.

There are two classes of the disaster scenarios:

  • Server Loss: The FreeIPA deployment loses one, several, or all servers due to a disaster (fire, earthquake, hardware malfunction, etc.) and needs to get back online as soon as possible.
  • Data Loss: FreeIPA data was accidentally deleted, either by a user or by a software bug, and the deletion was propagated to all servers, given that FreeIPA is a multi-master solution.

Server Loss Cases

Our usual recommendation for redundancy is to run several (2-3) FreeIPA Servers in each data center in customer deployment and let them replicate with each other. This way, when one server is lost, it can be recovered by simply creating a new FreeIPA Server (replica) and get back to fully functional state really quickly.

However, FreeIPA Servers are not born equal and a lot depends on the configuration. The first server that was ever installed has special properties that make it slightly different from others. It can be called as the first master. The first master is no different from other masters in the deployment except for the certificate management aspects. If the first master was installed with the full root or chained CA it is the server that:

  • Tracks and renews internal certificates: all other servers just get a copy of the renewed certs
  • Publishes CRLs: all other masters do not do that by default but can be configured

Other masters can be deployed with full CA like the first master or without CA like a more lightweight replica. Since FreeIPA 3.2 there is also a way to install the whole deployment without CA (CA-less installation). Then all FreeIPA Servers are equal and there is no distinction between them.

One Server Loss

The one server loss scenarios are related to the cases when, for example, a hardware failure is experienced and a server in a deployment needs to be removed and potentially replaced. It is worth mentioning here that the replace scenario in the following procedures is always about following a remove/cleanup procedure and then installing a new server on the same or different hardware, or provisioning a new VM with the same or different identity.

One Server Loss - First Master

If the first master is lost, other master with a full CA (if it is not a CA-less deployment) needs to be nominated as the new first master following a manual procedure:

  1. Clean deployment from the lost server by removing all replication agreements with it.
  2. Choose another FreeIPA Server with CA installed to become the first master
  3. Nominate this master to be the one in charge or renewing certs and publishing CRLS. This is a manual procedure at the moment.
  4. Follow standard installation procedure to deploy a new master on a hardware/VM of your choice

Changing the topology of a deployment has an impact on the clients. To mitigate this impact we recommend to rely on the DNS discovery and let clients automatically adapt to the topology changes. If FreeIPA DNS is used the topology changes are reflected automatically. If DNS is managed manually, DNS SRV records need to be updated to reflect the new FreeIPA topology. If FreeIPA clients were configured to explicitly connect to specific servers, their configuration also needs to reflect new FreeIPA Server hostnames.

Not all servers might have DNS service configured. But since DNS data is replicated across all replicas, the DNS service can be installed at any moment after finishing the procedure above.

One Server Loss - Any other server

When the first master is still running, procedure to recover a lost FreeIPA Server is more straightforward:

  1. Clean deployment from the lost server by removing all replication agreements with it.
  2. Follow standard installation procedure to deploy a new master on a hardware/VM of your choice
Not all servers might have DNS service configured. But since DNS data is replicated across all replicas, the DNS service can be installed at any moment after finishing the procedure above.

Several Server Loss

If several servers are lost at the same time, one needs to determine whether the deployment is rebuild-able from what is left or not.

First Master is still alive

If first master is still alive One Server Loss - Any other server procedure can be followed to rebuild every lost server and restore the environment.

First Master is lost

If there was an installation with a CA and at least one master with CA is available then first follow the procedure One Server Loss - First Master to establish the new first master and then follow the procedure One Server Loss - Any other server procedure for every lost server to rebuild the environment.

If there is no master with a CA left, the deployment effectively lost an ability to rebuild itself from what is left. Therefore, this case needs to be treated as a total loss scenario.

This is a CA-less deployment

In a CA-less deployment all masters are equal and One Server Loss - Any other server procedure can be followed to rebuild the environment

Total Infrastructure Loss

This is the case when all the servers in a deployment are lost or what is left is not good enough to rebuild the deployment. For that specific case we suggest that you run one of your replicas (with full CA if it is not a CA-less install) in a VM then periodically stop this VM and have a full snapshot of it and then bring it back again.

Why snapshot and not backup and restore scripts?

Let us step back and reflect a bit on the choice between snapshot and backup scripts. Snapshot saves everything in a consistent state. Backup is supposed to do it too. However backup has several major differences:

  • Backup stores only data and not the software itself while snapshot saves both data and the software
  • Backup selects what to save one by one based on the code written by developer, snapshot takes everything

As one can see with a backup script there is much more room for a human mistake. The data when restored might not match the software expectation because software was updated between the moments when backup and restore events happened. Of course, special checks can be added but this means more code, more complexity and more risk to make a mistake.

The selectiveness of the backup is yet another concern. The software evolves and new features may affect the data so that backup might not pick everything or restore would overwrite something. Since there is no way to predict the state of the data when the restore will be run there is a higher risk that problems would be introduced. Which is especially troublesome in situations when coming back online as soon as possible is crucial. We feel that using less risky procedures would enable FreeIPA users to recover faster. This is the main reason why FreeIPA team was reluctant to build custom backup and restore scripts.

Backup and restore scripts

FreeIPA 3.2 introduced experimental backup and restore scripts. See man pages for ipa-backup and ipa-restore scripts for the instructions how to backup and restore FreeIPA software and/or the database. A feedback from real user deployments is essential for decision if the scripts should be further developed by the FreeIPA team.

Recovering from a snapshot

Nothing left other than the snapshot

Boot the snapshot VM and follow the procedure [[#Several Server Loss - There is a full CA master|Several Server Loss - There is a full CA master]]. When the procedure is completed, there is a restored and functional deployment but the VM is now the first master. We recommend that other server is nominated and updated to be the first master.

Something is left other than the snapshot

This is the situation when there are remnants of the old infrastructure that do not allow to fully rebuild (they for example miss an FreeIPA Server with CA configured), but they are still functional and have the database intact so that clients can authenticate while the environment is being rebuilt.

In such situation we recommend following procedure:

  1. Clean remaining FreeIPA Servers from replication agreements with the lost servers. The goal is to have have a set of synchronized remaining FreeIPA Servers with functional replication agreements between each other. Replication agreement with the snapshot VM can be left intact as it will be used to synchronized the snapshot back up with the remaining FreeIPA infrastructure later.
  2. Boot the selected snapshot and start the restored FreeIPA Server
  3. See if the FreeIPA Server running from snapshot has a replication agreement with one of the other FreeIPA Servers that survived. If not, connect the FreeIPA Server running from the snapshot to one of the servers that survived to replica data.
  4. Check /var/log/dirsrv/slapd-YOUR-INSTANCE/errors and see if the FreeIPA Server running from the snapshot correctly synchronizes with the remaining FreeIPA Servers and if it received the fresh data. If the replication fails for the database being too old, it can be reinitialized from a running FreeIPA Server.
  5. If database is correctly synchronized, install any required additional FreeIPA Servers to fully restore the FreeIPA infrastructure

If the backed up snapshot is too old and it's state is not consistent with a state of the remaining FreeIPA Servers so that it's database can be neither synchronized nor reinitialized, a different procedure needs to be applied:

  1. Remove any replication agreement of the remaining FreeIPA Servers with the IPA Server that will be restored from a snapshot. This step will prevent replication of inconsistent data to the restored FreeIPA Server
  2. Boot the selected snapshot and start the restored FreeIPA Server. Install a sufficient amount of FreeIPA replicas from the FreeIPA Server running from the snapshot to be able to handle the load of the deployment. When step 2 is finished, there will be 2 disconnected FreeIPA deployments
  3. Switch clients to use the restored FreeIPA Servers
  4. Stop and uninstall FreeIPA Servers of the old infrastructure
  5. Install any required additional FreeIPA Servers to fully restore the FreeIPA infrastructure

In this case, old FreeIPA Servers and the new FreeIPA Servers should run in parallel only for a limited amount of time needed to create a sufficient restored FreeIPA infrastructure to limit data inconsistencies between these two disconnected FreeIPA realms.

Recovering FreeIPA clients

While FreeIPA Servers are restored, FreeIPA clients may need changes as well:

  • FreeIPA Server hostnames: if client is configured with a hardcoded FreeIPA Server hostname and the hostname was changed (i.e. during restoration process), it's configuration needs to be updated to reflect the new hostnames. At minimum, /etc/sssd/sssd.conf and /etc/krb5.conf should be updated. Situation is easier if the FreeIPA clients are configured to use FreeIPA Server autodiscovery via DNS SRV records. Then only the DNS SRV needs to be updated to let the FreeIPA clients properly resolve the servers.
  • Stale data: especially in Total Infrastructure Loss cases, it may make sense to remove any stale data on client and purge the SSSD cache. This step is optional, but it should be considered if users experience inconsistent login behavior on FreeIPA clients.

Data Loss Cases

Up to this point, we have discussed only server losses and not about the data losses. Data loss cases are more difficult to recover from and there is no generic procedure that can be reused among all FreeIPA deployments. The procedure depends on the data that was lost; Is it a group? A set of users? Host data? Let us provide some guidelines on how to recover from this situation in the best way.

As soon as it is determined that there is a data loss situation, actions should be taken to try to stop the data loss proliferation in the infrastructure. Affected replicas should be isolated and brought down as soon as possible to prevent spread of the data loss.

If the data loss is stopped and there is still a part of the infrastructure that is not affected, the situation may be treated as all the servers that are corrupted are lost and left with the ones that are not. The recovery procedures described in the Server Loss Cases can be used to rebuild the infrastructure. Affected replicas should not be brought up until re-deployed to avoid corrupted data spreading again.

If the data loss affected all servers there are 2 options:

  • Start over from a snapshot as described in the Something is left other than the snapshot and make sure there is no replication between old and new replicas. This is a pretty drastic approach and should be used when most of the data is lost and deployment is completely dysfunctional anyway.
  • Re-add lost data. Use this method if the scope of the data loss can be identified. This procedure expects that either there exists a VM snapshot an FreeIPA Server before the data loss that can be used to retrieve s snapshot of the database (LDIF) with the database or that the FreeIPA Server database was backed up, either by using standard Directory Server tools to back up the data (db2ldif) or by using FreeIPA backup command (ipa-backup --data --online) if it is available in the deployed version of the FreeIPA. When the database snapshot (LDIF) is retrieved, lost LDAP entries can be retrieved from it and added to the database via standard ldapadd command. The restored entries will be then automatically propagated to other masters by standard replication.