Changeset View
Changeset View
Standalone View
Standalone View
src/docs/user/cluster/cluster_repositories.diviner
@title Cluster: Repositories | @title Cluster: Repositories | ||||
@group cluster | @group cluster | ||||
Configuring Phabricator to use multiple repository hosts. | Configuring Phorge to use multiple repository hosts. | ||||
Overview | Overview | ||||
======== | ======== | ||||
If you use Git, you can deploy Phabricator with multiple repository hosts, | If you use Git, you can deploy Phorge with multiple repository hosts, | ||||
configured so that each host is readable and writable. The advantages of doing | configured so that each host is readable and writable. The advantages of doing | ||||
this are: | this are: | ||||
- you can completely survive the loss of repository hosts; | - you can completely survive the loss of repository hosts; | ||||
- reads and writes can scale across multiple machines; and | - reads and writes can scale across multiple machines; and | ||||
- read and write performance across multiple geographic regions may improve. | - read and write performance across multiple geographic regions may improve. | ||||
This configuration is complex, and many installs do not need to pursue it. | This configuration is complex, and many installs do not need to pursue it. | ||||
This configuration is not currently supported with Subversion or Mercurial. | This configuration is not currently supported with Subversion or Mercurial. | ||||
How Reads and Writes Work | How Reads and Writes Work | ||||
========================= | ========================= | ||||
Phabricator repository replicas are multi-master: every node is readable and | Phorge repository replicas are multi-master: every node is readable and | ||||
writable, and a cluster of nodes can (almost always) survive the loss of any | writable, and a cluster of nodes can (almost always) survive the loss of any | ||||
arbitrary subset of nodes so long as at least one node is still alive. | arbitrary subset of nodes so long as at least one node is still alive. | ||||
Phabricator maintains an internal version for each repository, and increments | Phorge maintains an internal version for each repository, and increments | ||||
it when the repository is mutated. | it when the repository is mutated. | ||||
Before responding to a read, replicas make sure their version of the repository | Before responding to a read, replicas make sure their version of the repository | ||||
is up to date (no node in the cluster has a newer version of the repository). | is up to date (no node in the cluster has a newer version of the repository). | ||||
If it isn't, they block the read until they can complete a fetch. | If it isn't, they block the read until they can complete a fetch. | ||||
Before responding to a write, replicas obtain a global lock, perform the same | Before responding to a write, replicas obtain a global lock, perform the same | ||||
version check and fetch if necessary, then allow the write to continue. | version check and fetch if necessary, then allow the write to continue. | ||||
Show All 34 Lines | |||||
Other mitigations are possible, but securing a network against the NSA and | Other mitigations are possible, but securing a network against the NSA and | ||||
similar agents of other rogue nations is beyond the scope of this document. | similar agents of other rogue nations is beyond the scope of this document. | ||||
Repository Hosts | Repository Hosts | ||||
================ | ================ | ||||
Repository hosts must run a complete, fully configured copy of Phabricator, | Repository hosts must run a complete, fully configured copy of Phorge, | ||||
including a webserver. They must also run a properly configured `sshd`. | including a webserver. They must also run a properly configured `sshd`. | ||||
If you are converting existing hosts into cluster hosts, you may need to | If you are converting existing hosts into cluster hosts, you may need to | ||||
revisit @{article:Diffusion User Guide: Repository Hosting} and make sure | revisit @{article:Diffusion User Guide: Repository Hosting} and make sure | ||||
the system user accounts have all the necessary `sudo` permissions. In | the system user accounts have all the necessary `sudo` permissions. In | ||||
particular, cluster devices need `sudo` access to `ssh` so they can read | particular, cluster devices need `sudo` access to `ssh` so they can read | ||||
device keys. | device keys. | ||||
Show All 29 Lines | |||||
Edit Policies > | Edit Policies > | ||||
Can Manage Cluster Services } | Can Manage Cluster Services } | ||||
Once the hosts are registered as devices, you can create a new service in | Once the hosts are registered as devices, you can create a new service in | ||||
Almanac: | Almanac: | ||||
- First, register at least one device according to the device clustering | - First, register at least one device according to the device clustering | ||||
instructions. | instructions. | ||||
- Create a new service of type **Phabricator Cluster: Repository** in | - Create a new service of type **Phorge Cluster: Repository** in | ||||
Almanac. | Almanac. | ||||
- Bind this service to all the interfaces on the device or devices. | - Bind this service to all the interfaces on the device or devices. | ||||
- For each binding, add a `protocol` key with one of these values: | - For each binding, add a `protocol` key with one of these values: | ||||
`ssh`, `http`, `https`. | `ssh`, `http`, `https`. | ||||
For example, a service might look like this: | For example, a service might look like this: | ||||
- Service: `repos001.mycompany.net` | - Service: `repos001.mycompany.net` | ||||
Show All 30 Lines | |||||
``` | ``` | ||||
To migrate a repository back off a service, use this command: | To migrate a repository back off a service, use this command: | ||||
``` | ``` | ||||
$ ./bin/repository clusterize <repository> --remove-service | $ ./bin/repository clusterize <repository> --remove-service | ||||
``` | ``` | ||||
This command only changes how Phabricator connects to the repository; it does | This command only changes how Phorge connects to the repository; it does | ||||
not move any data or make any complex structural changes. | not move any data or make any complex structural changes. | ||||
When Phabricator needs information about a non-clustered repository, it just | When Phorge needs information about a non-clustered repository, it just | ||||
runs a command like `git log` directly on disk. When Phabricator needs | runs a command like `git log` directly on disk. When Phorge needs | ||||
information about a clustered repository, it instead makes a service call to | information about a clustered repository, it instead makes a service call to | ||||
another server, asking that server to run `git log` instead. | another server, asking that server to run `git log` instead. | ||||
In a single-host cluster the server will make this service call to itself, so | In a single-host cluster the server will make this service call to itself, so | ||||
nothing will really change. But this //is// an effective test for most | nothing will really change. But this //is// an effective test for most | ||||
possible configuration mistakes. | possible configuration mistakes. | ||||
If your canary repository works well, you can migrate the rest of your | If your canary repository works well, you can migrate the rest of your | ||||
Show All 22 Lines | To expand an existing cluster, follow these general steps: | ||||
- Register new devices in Almanac. | - Register new devices in Almanac. | ||||
- Add bindings to the new devices to the repository service, also in Almanac. | - Add bindings to the new devices to the repository service, also in Almanac. | ||||
- Start the daemons on the new devices. | - Start the daemons on the new devices. | ||||
For instructions on configuring and registering devices, see | For instructions on configuring and registering devices, see | ||||
@{article:Cluster: Devices}. | @{article:Cluster: Devices}. | ||||
As soon as you add active bindings to a service, Phabricator will begin | As soon as you add active bindings to a service, Phorge will begin | ||||
synchronizing repositories and sending traffic to the new device. You do not | synchronizing repositories and sending traffic to the new device. You do not | ||||
need to copy any repository data to the device: Phabricator will automatically | need to copy any repository data to the device: Phorge will automatically | ||||
synchronize it. | synchronize it. | ||||
If you have a large amount of repository data, you may want to help this | If you have a large amount of repository data, you may want to help this | ||||
process along by copying the repository directory from an existing cluster | process along by copying the repository directory from an existing cluster | ||||
device before bringing the new host online. This is optional, but can reduce | device before bringing the new host online. This is optional, but can reduce | ||||
the amount of time required to fully synchronize the cluster. | the amount of time required to fully synchronize the cluster. | ||||
You do not need to synchronize the most up-to-date data or stop writes during | You do not need to synchronize the most up-to-date data or stop writes during | ||||
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines | |||||
You can get a more detailed view the current status of a specific repository on | You can get a more detailed view the current status of a specific repository on | ||||
cluster devices in {nav Diffusion > (Repository) > Manage Repository > Cluster | cluster devices in {nav Diffusion > (Repository) > Manage Repository > Cluster | ||||
Configuration}. | Configuration}. | ||||
This screen shows all the configured devices which are hosting the repository | This screen shows all the configured devices which are hosting the repository | ||||
and the available version on that device. | and the available version on that device. | ||||
**Version**: When a repository is mutated by a push, Phabricator increases | **Version**: When a repository is mutated by a push, Phorge increases | ||||
an internal version number for the repository. This column shows which version | an internal version number for the repository. This column shows which version | ||||
is on disk on the corresponding device. | is on disk on the corresponding device. | ||||
After a change is pushed, the device which received the change will have a | After a change is pushed, the device which received the change will have a | ||||
larger version number than the other devices. The change should be passively | larger version number than the other devices. The change should be passively | ||||
replicated to the remaining devices after a brief period of time, although this | replicated to the remaining devices after a brief period of time, although this | ||||
can take a while if the change was large or the network connection between | can take a while if the change was large or the network connection between | ||||
devices is slow or unreliable. | devices is slow or unreliable. | ||||
Show All 21 Lines | |||||
There are three major cluster failure modes: | There are three major cluster failure modes: | ||||
- **Write Interruptions**: A write started but did not complete, leaving | - **Write Interruptions**: A write started but did not complete, leaving | ||||
the disk state and cluster state out of sync. | the disk state and cluster state out of sync. | ||||
- **Loss of Leaders**: None of the devices with the most up-to-date data | - **Loss of Leaders**: None of the devices with the most up-to-date data | ||||
are reachable. | are reachable. | ||||
- **Ambiguous Leaders**: The internal state of the repository is unclear. | - **Ambiguous Leaders**: The internal state of the repository is unclear. | ||||
Phabricator can detect these issues, and responds by freezing the repository | Phorge can detect these issues, and responds by freezing the repository | ||||
(usually preventing all reads and writes) until the issue is resolved. These | (usually preventing all reads and writes) until the issue is resolved. These | ||||
conditions are normally rare and very little data is at risk, but Phabricator | conditions are normally rare and very little data is at risk, but Phorge | ||||
errs on the side of caution and requires decisions which may result in data | errs on the side of caution and requires decisions which may result in data | ||||
loss to be confirmed by a human. | loss to be confirmed by a human. | ||||
The next sections cover these failure modes and appropriate responses in | The next sections cover these failure modes and appropriate responses in | ||||
more detail. In general, you will respond to these issues by assessing the | more detail. In general, you will respond to these issues by assessing the | ||||
situation and then possibly choosing to discard some data. | situation and then possibly choosing to discard some data. | ||||
Write Interruptions | Write Interruptions | ||||
=================== | =================== | ||||
A repository cluster can be put into an inconsistent state by an interruption | A repository cluster can be put into an inconsistent state by an interruption | ||||
in a brief window during and immediately after a write. This looks like this: | in a brief window during and immediately after a write. This looks like this: | ||||
- A change is pushed to a server. | - A change is pushed to a server. | ||||
- The server acquires a write lock and begins writing the change. | - The server acquires a write lock and begins writing the change. | ||||
- During or immediately after the write, lightning strikes the server | - During or immediately after the write, lightning strikes the server | ||||
and destroys it. | and destroys it. | ||||
Phabricator can not commit changes to a working copy (stored on disk) and to | Phorge can not commit changes to a working copy (stored on disk) and to | ||||
the global state (stored in a database) atomically, so there is necessarily a | the global state (stored in a database) atomically, so there is necessarily a | ||||
narrow window between committing these two different states when some tragedy | narrow window between committing these two different states when some tragedy | ||||
can befall a server, leaving the global and local views of the repository state | can befall a server, leaving the global and local views of the repository state | ||||
possibly divergent. | possibly divergent. | ||||
In these cases, Phabricator fails into a frozen state where further writes | In these cases, Phorge fails into a frozen state where further writes | ||||
are not permitted until the failure is investigated and resolved. When a | are not permitted until the failure is investigated and resolved. When a | ||||
repository is frozen in this way it remains readable. | repository is frozen in this way it remains readable. | ||||
You can use the monitoring console to review the state of a frozen repository | You can use the monitoring console to review the state of a frozen repository | ||||
with a held write lock. The **Writing** column will show which device is | with a held write lock. The **Writing** column will show which device is | ||||
holding the lock, and whoever is named in the **Last Writer** column may be | holding the lock, and whoever is named in the **Last Writer** column may be | ||||
able to help you figure out what happened by providing more information about | able to help you figure out what happened by providing more information about | ||||
what they were doing and what they observed. | what they were doing and what they observed. | ||||
Show All 11 Lines | |||||
you can recover it manually from the working copy on the device (for example, | you can recover it manually from the working copy on the device (for example, | ||||
by using `git format-patch`) and then push it again after recovering. | by using `git format-patch`) and then push it again after recovering. | ||||
If you demote the device, the in-process write will be thrown away, even if it | If you demote the device, the in-process write will be thrown away, even if it | ||||
was complete on disk. To demote the device and release the write lock, run this | was complete on disk. To demote the device and release the write lock, run this | ||||
command: | command: | ||||
``` | ``` | ||||
phabricator/ $ ./bin/repository thaw <repository> --demote <device> | phorge/ $ ./bin/repository thaw <repository> --demote <device> | ||||
``` | ``` | ||||
{icon exclamation-triangle, color="yellow"} Any committed but unacknowledged | {icon exclamation-triangle, color="yellow"} Any committed but unacknowledged | ||||
data on the device will be lost. | data on the device will be lost. | ||||
Loss of Leaders | Loss of Leaders | ||||
=============== | =============== | ||||
A more straightforward failure condition is the loss of all servers in a | A more straightforward failure condition is the loss of all servers in a | ||||
cluster which have the most up-to-date copy of a repository. This looks like | cluster which have the most up-to-date copy of a repository. This looks like | ||||
this: | this: | ||||
- There is a cluster setup with two devices, X and Y. | - There is a cluster setup with two devices, X and Y. | ||||
- A new change is pushed to server X. | - A new change is pushed to server X. | ||||
- Before the change can propagate to server Y, lightning strikes server X | - Before the change can propagate to server Y, lightning strikes server X | ||||
and destroys it. | and destroys it. | ||||
Here, all of the "leader" devices with the most up-to-date copy of the | Here, all of the "leader" devices with the most up-to-date copy of the | ||||
repository have been lost. Phabricator will freeze the repository refuse to | repository have been lost. Phorge will freeze the repository refuse to | ||||
serve requests because it can not serve reads consistently and can not accept | serve requests because it can not serve reads consistently and can not accept | ||||
new writes without data loss. | new writes without data loss. | ||||
The most straightforward way to resolve this issue is to restore any leader to | The most straightforward way to resolve this issue is to restore any leader to | ||||
service. The change will be able to replicate to other devices once a leader | service. The change will be able to replicate to other devices once a leader | ||||
comes back online. | comes back online. | ||||
If you are unable to restore a leader or unsure that you can restore one | If you are unable to restore a leader or unsure that you can restore one | ||||
quickly, you can use the monitoring console to review which changes are | quickly, you can use the monitoring console to review which changes are | ||||
present on the leaders but not present on the followers by examining the | present on the leaders but not present on the followers by examining the | ||||
push logs. | push logs. | ||||
If you are comfortable discarding these changes, you can instruct Phabricator | If you are comfortable discarding these changes, you can instruct Phorge | ||||
that it can forget about the leaders by doing this: | that it can forget about the leaders by doing this: | ||||
- Disable the service bindings to all of the leader devices so they are no | - Disable the service bindings to all of the leader devices so they are no | ||||
longer part of the cluster. | longer part of the cluster. | ||||
- Then, use `bin/repository thaw` to `--demote` the leaders explicitly. | - Then, use `bin/repository thaw` to `--demote` the leaders explicitly. | ||||
To demote a device, run this command: | To demote a device, run this command: | ||||
``` | ``` | ||||
phabricator/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net | phorge/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net | ||||
``` | ``` | ||||
{icon exclamation-triangle, color="red"} Any data which is only present on | {icon exclamation-triangle, color="red"} Any data which is only present on | ||||
the demoted device will be lost. | the demoted device will be lost. | ||||
If you do this, **you will lose unreplicated data**. You will discard any | If you do this, **you will lose unreplicated data**. You will discard any | ||||
changes on the affected leaders which have not replicated to other devices | changes on the affected leaders which have not replicated to other devices | ||||
in the cluster. | in the cluster. | ||||
If you have lost an entire cluster and replaced it with new devices that you | If you have lost an entire cluster and replaced it with new devices that you | ||||
have restored from backups, you can aggressively wipe all memory of the old | have restored from backups, you can aggressively wipe all memory of the old | ||||
devices by using `--demote <service>` and `--all-repositories`. **This is | devices by using `--demote <service>` and `--all-repositories`. **This is | ||||
dangerous and discards all unreplicated data in any repository on any device.** | dangerous and discards all unreplicated data in any repository on any device.** | ||||
``` | ``` | ||||
phabricator/ $ ./bin/repository thaw --demote repo.corp.net --all-repositories | phorge/ $ ./bin/repository thaw --demote repo.corp.net --all-repositories | ||||
``` | ``` | ||||
After you do this, continue below to promote a leader and restore the cluster | After you do this, continue below to promote a leader and restore the cluster | ||||
to service. | to service. | ||||
Ambiguous Leaders | Ambiguous Leaders | ||||
================= | ================= | ||||
Repository clusters can also freeze if the leader devices are ambiguous. This | Repository clusters can also freeze if the leader devices are ambiguous. This | ||||
can happen if you replace an entire cluster with new devices suddenly, or make | can happen if you replace an entire cluster with new devices suddenly, or make | ||||
a mistake with the `--demote` flag. This may arise from some kind of operator | a mistake with the `--demote` flag. This may arise from some kind of operator | ||||
error, like these: | error, like these: | ||||
- Someone accidentally uses `bin/repository thaw ... --demote` to demote | - Someone accidentally uses `bin/repository thaw ... --demote` to demote | ||||
every device in a cluster. | every device in a cluster. | ||||
- Someone accidentally deletes all the version information for a repository | - Someone accidentally deletes all the version information for a repository | ||||
from the database by making a mistake with a `DELETE` or `UPDATE` query. | from the database by making a mistake with a `DELETE` or `UPDATE` query. | ||||
- Someone accidentally disables all of the devices in a cluster, then adds | - Someone accidentally disables all of the devices in a cluster, then adds | ||||
entirely new ones before repositories can propagate. | entirely new ones before repositories can propagate. | ||||
If you are moving repositories into cluster services, you can also reach this | If you are moving repositories into cluster services, you can also reach this | ||||
state if you use `clusterize` to associate a repository with a service that is | state if you use `clusterize` to associate a repository with a service that is | ||||
bound to multiple active devices. In this case, Phabricator will not know which | bound to multiple active devices. In this case, Phorge will not know which | ||||
device or devices have up-to-date information. | device or devices have up-to-date information. | ||||
When Phabricator can not tell which device in a cluster is a leader, it freezes | When Phorge can not tell which device in a cluster is a leader, it freezes | ||||
the cluster because it is possible that some devices have less data and others | the cluster because it is possible that some devices have less data and others | ||||
have more, and if it chooses a leader arbitrarily it may destroy some data | have more, and if it chooses a leader arbitrarily it may destroy some data | ||||
which you would prefer to retain. | which you would prefer to retain. | ||||
To resolve this, you need to tell Phabricator which device has the most | To resolve this, you need to tell Phorge which device has the most | ||||
up-to-date data and promote that device to become a leader. If you know all | up-to-date data and promote that device to become a leader. If you know all | ||||
devices have the same data, you are free to promote any device. | devices have the same data, you are free to promote any device. | ||||
If you promote a device, **you may lose data** if you promote the wrong device | If you promote a device, **you may lose data** if you promote the wrong device | ||||
and some other device really had more up-to-date data. If you want to double | and some other device really had more up-to-date data. If you want to double | ||||
check, you can examine the working copies on disk before promoting by | check, you can examine the working copies on disk before promoting by | ||||
connecting to the machines and using commands like `git log` to inspect state. | connecting to the machines and using commands like `git log` to inspect state. | ||||
Once you have identified a device which has data you're happy with, use | Once you have identified a device which has data you're happy with, use | ||||
`bin/repository thaw` to `--promote` the device. The data on the chosen | `bin/repository thaw` to `--promote` the device. The data on the chosen | ||||
device will become authoritative: | device will become authoritative: | ||||
``` | ``` | ||||
phabricator/ $ ./bin/repository thaw rXYZ --promote repo002.corp.net | phorge/ $ ./bin/repository thaw rXYZ --promote repo002.corp.net | ||||
``` | ``` | ||||
{icon exclamation-triangle, color="red"} Any data which is only present on | {icon exclamation-triangle, color="red"} Any data which is only present on | ||||
**other** devices will be lost. | **other** devices will be lost. | ||||
Backups | Backups | ||||
====== | ====== | ||||
Even if you configure clustering, you should still consider retaining separate | Even if you configure clustering, you should still consider retaining separate | ||||
backup snapshots. Replicas protect you from data loss if you lose a host, but | backup snapshots. Replicas protect you from data loss if you lose a host, but | ||||
they do not let you rewind time to recover from data mutation mistakes. | they do not let you rewind time to recover from data mutation mistakes. | ||||
If something issues a `--force` push that destroys branch heads, the mutation | If something issues a `--force` push that destroys branch heads, the mutation | ||||
will propagate to the replicas. | will propagate to the replicas. | ||||
You may be able to manually restore the branches by using tools like the | You may be able to manually restore the branches by using tools like the | ||||
Phabricator push log or the Git reflog so it is less important to retain | Phorge push log or the Git reflog so it is less important to retain | ||||
repository snapshots than database snapshots, but it is still possible for | repository snapshots than database snapshots, but it is still possible for | ||||
data to be lost permanently, especially if you don't notice the problem for | data to be lost permanently, especially if you don't notice the problem for | ||||
some time. | some time. | ||||
Retaining separate backup snapshots will improve your ability to recover more | Retaining separate backup snapshots will improve your ability to recover more | ||||
data more easily in a wider range of disaster situations. | data more easily in a wider range of disaster situations. | ||||
Show All 12 Lines | |||||
can use a maintenance lock to safely make a working copy mutable. | can use a maintenance lock to safely make a working copy mutable. | ||||
If you simply perform this kind of content-modifying maintenance by directly | If you simply perform this kind of content-modifying maintenance by directly | ||||
modifying the repository on disk with commands like `git update-ref`, your | modifying the repository on disk with commands like `git update-ref`, your | ||||
changes may either encounter conflicts or encounter problems with change | changes may either encounter conflicts or encounter problems with change | ||||
propagation. | propagation. | ||||
You can encounter conflicts because directly modifying the working copy on disk | You can encounter conflicts because directly modifying the working copy on disk | ||||
won't prevent users or Phabricator itself from performing writes to the same | won't prevent users or Phorge itself from performing writes to the same | ||||
working copy at the same time. Phabricator does not compromise the lower-level | working copy at the same time. Phorge does not compromise the lower-level | ||||
locks provided by the VCS so this is theoretically safe -- and this rarely | locks provided by the VCS so this is theoretically safe -- and this rarely | ||||
causes any significant problems in practice -- but doesn't make things any | causes any significant problems in practice -- but doesn't make things any | ||||
simpler or easier. | simpler or easier. | ||||
Your changes may fail to propagate because writing directly to the repository | Your changes may fail to propagate because writing directly to the repository | ||||
doesn't turn it into the new cluster leader after your writes complete. If | doesn't turn it into the new cluster leader after your writes complete. If | ||||
another node accepts the next push, it will become the new leader -- without | another node accepts the next push, it will become the new leader -- without | ||||
your changes -- and all other nodes will synchronize from it. | your changes -- and all other nodes will synchronize from it. | ||||
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines |
Content licensed under Creative Commons Attribution-ShareAlike 4.0 (CC-BY-SA) unless otherwise noted; code licensed under Apache 2.0 or other open source licenses. · CC BY-SA 4.0 · Apache 2.0