Differential D25007 Diff 23 src/docs/user/cluster/cluster_repositories.diviner

Changeset View

Standalone View

src/docs/user/cluster/cluster_repositories.diviner

@title Cluster: Repositories		@title Cluster: Repositories
@group cluster		@group cluster

Configuring Phabricator to use multiple repository hosts.		Configuring Phorge to use multiple repository hosts.

Overview		Overview
========		========

If you use Git, you can deploy Phabricator with multiple repository hosts,		If you use Git, you can deploy Phorge with multiple repository hosts,
configured so that each host is readable and writable. The advantages of doing		configured so that each host is readable and writable. The advantages of doing
this are:		this are:

- you can completely survive the loss of repository hosts;		- you can completely survive the loss of repository hosts;
- reads and writes can scale across multiple machines; and		- reads and writes can scale across multiple machines; and
- read and write performance across multiple geographic regions may improve.		- read and write performance across multiple geographic regions may improve.

This configuration is complex, and many installs do not need to pursue it.		This configuration is complex, and many installs do not need to pursue it.

This configuration is not currently supported with Subversion or Mercurial.		This configuration is not currently supported with Subversion or Mercurial.


How Reads and Writes Work		How Reads and Writes Work
=========================		=========================

Phabricator repository replicas are multi-master: every node is readable and		Phorge repository replicas are multi-master: every node is readable and
writable, and a cluster of nodes can (almost always) survive the loss of any		writable, and a cluster of nodes can (almost always) survive the loss of any
arbitrary subset of nodes so long as at least one node is still alive.		arbitrary subset of nodes so long as at least one node is still alive.

Phabricator maintains an internal version for each repository, and increments		Phorge maintains an internal version for each repository, and increments
it when the repository is mutated.		it when the repository is mutated.

Before responding to a read, replicas make sure their version of the repository		Before responding to a read, replicas make sure their version of the repository
is up to date (no node in the cluster has a newer version of the repository).		is up to date (no node in the cluster has a newer version of the repository).
If it isn't, they block the read until they can complete a fetch.		If it isn't, they block the read until they can complete a fetch.

Before responding to a write, replicas obtain a global lock, perform the same		Before responding to a write, replicas obtain a global lock, perform the same
version check and fetch if necessary, then allow the write to continue.		version check and fetch if necessary, then allow the write to continue.
Show All 34 Lines

Other mitigations are possible, but securing a network against the NSA and		Other mitigations are possible, but securing a network against the NSA and
similar agents of other rogue nations is beyond the scope of this document.		similar agents of other rogue nations is beyond the scope of this document.


Repository Hosts		Repository Hosts
================		================

Repository hosts must run a complete, fully configured copy of Phabricator,		Repository hosts must run a complete, fully configured copy of Phorge,
including a webserver. They must also run a properly configured `sshd`.		including a webserver. They must also run a properly configured `sshd`.

If you are converting existing hosts into cluster hosts, you may need to		If you are converting existing hosts into cluster hosts, you may need to
revisit @{article:Diffusion User Guide: Repository Hosting} and make sure		revisit @{article:Diffusion User Guide: Repository Hosting} and make sure
the system user accounts have all the necessary `sudo` permissions. In		the system user accounts have all the necessary `sudo` permissions. In
particular, cluster devices need `sudo` access to `ssh` so they can read		particular, cluster devices need `sudo` access to `ssh` so they can read
device keys.		device keys.

Show All 29 Lines
Edit Policies >		Edit Policies >
Can Manage Cluster Services }		Can Manage Cluster Services }

Once the hosts are registered as devices, you can create a new service in		Once the hosts are registered as devices, you can create a new service in
Almanac:		Almanac:

- First, register at least one device according to the device clustering		- First, register at least one device according to the device clustering
instructions.		instructions.
- Create a new service of type Phabricator Cluster: Repository in		- Create a new service of type Phorge Cluster: Repository in
Almanac.		Almanac.
- Bind this service to all the interfaces on the device or devices.		- Bind this service to all the interfaces on the device or devices.
- For each binding, add a `protocol` key with one of these values:		- For each binding, add a `protocol` key with one of these values:
`ssh`, `http`, `https`.		`ssh`, `http`, `https`.

For example, a service might look like this:		For example, a service might look like this:

- Service: `repos001.mycompany.net`		- Service: `repos001.mycompany.net`
Show All 30 Lines
```		```

To migrate a repository back off a service, use this command:		To migrate a repository back off a service, use this command:

```		```
$ ./bin/repository clusterize <repository> --remove-service		$ ./bin/repository clusterize <repository> --remove-service
```		```

This command only changes how Phabricator connects to the repository; it does		This command only changes how Phorge connects to the repository; it does
not move any data or make any complex structural changes.		not move any data or make any complex structural changes.

When Phabricator needs information about a non-clustered repository, it just		When Phorge needs information about a non-clustered repository, it just
runs a command like `git log` directly on disk. When Phabricator needs		runs a command like `git log` directly on disk. When Phorge needs
information about a clustered repository, it instead makes a service call to		information about a clustered repository, it instead makes a service call to
another server, asking that server to run `git log` instead.		another server, asking that server to run `git log` instead.

In a single-host cluster the server will make this service call to itself, so		In a single-host cluster the server will make this service call to itself, so
nothing will really change. But this //is// an effective test for most		nothing will really change. But this //is// an effective test for most
possible configuration mistakes.		possible configuration mistakes.

If your canary repository works well, you can migrate the rest of your		If your canary repository works well, you can migrate the rest of your
Show All 22 Lines	To expand an existing cluster, follow these general steps:

- Register new devices in Almanac.		- Register new devices in Almanac.
- Add bindings to the new devices to the repository service, also in Almanac.		- Add bindings to the new devices to the repository service, also in Almanac.
- Start the daemons on the new devices.		- Start the daemons on the new devices.

For instructions on configuring and registering devices, see		For instructions on configuring and registering devices, see
@{article:Cluster: Devices}.		@{article:Cluster: Devices}.

As soon as you add active bindings to a service, Phabricator will begin		As soon as you add active bindings to a service, Phorge will begin
synchronizing repositories and sending traffic to the new device. You do not		synchronizing repositories and sending traffic to the new device. You do not
need to copy any repository data to the device: Phabricator will automatically		need to copy any repository data to the device: Phorge will automatically
synchronize it.		synchronize it.

If you have a large amount of repository data, you may want to help this		If you have a large amount of repository data, you may want to help this
process along by copying the repository directory from an existing cluster		process along by copying the repository directory from an existing cluster
device before bringing the new host online. This is optional, but can reduce		device before bringing the new host online. This is optional, but can reduce
the amount of time required to fully synchronize the cluster.		the amount of time required to fully synchronize the cluster.

You do not need to synchronize the most up-to-date data or stop writes during		You do not need to synchronize the most up-to-date data or stop writes during
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines

You can get a more detailed view the current status of a specific repository on		You can get a more detailed view the current status of a specific repository on
cluster devices in {nav Diffusion > (Repository) > Manage Repository > Cluster		cluster devices in {nav Diffusion > (Repository) > Manage Repository > Cluster
Configuration}.		Configuration}.

This screen shows all the configured devices which are hosting the repository		This screen shows all the configured devices which are hosting the repository
and the available version on that device.		and the available version on that device.

Version: When a repository is mutated by a push, Phabricator increases		Version: When a repository is mutated by a push, Phorge increases
an internal version number for the repository. This column shows which version		an internal version number for the repository. This column shows which version
is on disk on the corresponding device.		is on disk on the corresponding device.

After a change is pushed, the device which received the change will have a		After a change is pushed, the device which received the change will have a
larger version number than the other devices. The change should be passively		larger version number than the other devices. The change should be passively
replicated to the remaining devices after a brief period of time, although this		replicated to the remaining devices after a brief period of time, although this
can take a while if the change was large or the network connection between		can take a while if the change was large or the network connection between
devices is slow or unreliable.		devices is slow or unreliable.
Show All 21 Lines
There are three major cluster failure modes:		There are three major cluster failure modes:

- Write Interruptions: A write started but did not complete, leaving		- Write Interruptions: A write started but did not complete, leaving
the disk state and cluster state out of sync.		the disk state and cluster state out of sync.
- Loss of Leaders: None of the devices with the most up-to-date data		- Loss of Leaders: None of the devices with the most up-to-date data
are reachable.		are reachable.
- Ambiguous Leaders: The internal state of the repository is unclear.		- Ambiguous Leaders: The internal state of the repository is unclear.

Phabricator can detect these issues, and responds by freezing the repository		Phorge can detect these issues, and responds by freezing the repository
(usually preventing all reads and writes) until the issue is resolved. These		(usually preventing all reads and writes) until the issue is resolved. These
conditions are normally rare and very little data is at risk, but Phabricator		conditions are normally rare and very little data is at risk, but Phorge
errs on the side of caution and requires decisions which may result in data		errs on the side of caution and requires decisions which may result in data
loss to be confirmed by a human.		loss to be confirmed by a human.

The next sections cover these failure modes and appropriate responses in		The next sections cover these failure modes and appropriate responses in
more detail. In general, you will respond to these issues by assessing the		more detail. In general, you will respond to these issues by assessing the
situation and then possibly choosing to discard some data.		situation and then possibly choosing to discard some data.


Write Interruptions		Write Interruptions
===================		===================

A repository cluster can be put into an inconsistent state by an interruption		A repository cluster can be put into an inconsistent state by an interruption
in a brief window during and immediately after a write. This looks like this:		in a brief window during and immediately after a write. This looks like this:

- A change is pushed to a server.		- A change is pushed to a server.
- The server acquires a write lock and begins writing the change.		- The server acquires a write lock and begins writing the change.
- During or immediately after the write, lightning strikes the server		- During or immediately after the write, lightning strikes the server
and destroys it.		and destroys it.

Phabricator can not commit changes to a working copy (stored on disk) and to		Phorge can not commit changes to a working copy (stored on disk) and to
the global state (stored in a database) atomically, so there is necessarily a		the global state (stored in a database) atomically, so there is necessarily a
narrow window between committing these two different states when some tragedy		narrow window between committing these two different states when some tragedy
can befall a server, leaving the global and local views of the repository state		can befall a server, leaving the global and local views of the repository state
possibly divergent.		possibly divergent.

In these cases, Phabricator fails into a frozen state where further writes		In these cases, Phorge fails into a frozen state where further writes
are not permitted until the failure is investigated and resolved. When a		are not permitted until the failure is investigated and resolved. When a
repository is frozen in this way it remains readable.		repository is frozen in this way it remains readable.

You can use the monitoring console to review the state of a frozen repository		You can use the monitoring console to review the state of a frozen repository
with a held write lock. The Writing column will show which device is		with a held write lock. The Writing column will show which device is
holding the lock, and whoever is named in the Last Writer column may be		holding the lock, and whoever is named in the Last Writer column may be
able to help you figure out what happened by providing more information about		able to help you figure out what happened by providing more information about
what they were doing and what they observed.		what they were doing and what they observed.
Show All 11 Lines
you can recover it manually from the working copy on the device (for example,		you can recover it manually from the working copy on the device (for example,
by using `git format-patch`) and then push it again after recovering.		by using `git format-patch`) and then push it again after recovering.

If you demote the device, the in-process write will be thrown away, even if it		If you demote the device, the in-process write will be thrown away, even if it
was complete on disk. To demote the device and release the write lock, run this		was complete on disk. To demote the device and release the write lock, run this
command:		command:

```		```
phabricator/ $ ./bin/repository thaw <repository> --demote <device>		phorge/ $ ./bin/repository thaw <repository> --demote <device>
```		```

{icon exclamation-triangle, color="yellow"} Any committed but unacknowledged		{icon exclamation-triangle, color="yellow"} Any committed but unacknowledged
data on the device will be lost.		data on the device will be lost.


Loss of Leaders		Loss of Leaders
===============		===============

A more straightforward failure condition is the loss of all servers in a		A more straightforward failure condition is the loss of all servers in a
cluster which have the most up-to-date copy of a repository. This looks like		cluster which have the most up-to-date copy of a repository. This looks like
this:		this:

- There is a cluster setup with two devices, X and Y.		- There is a cluster setup with two devices, X and Y.
- A new change is pushed to server X.		- A new change is pushed to server X.
- Before the change can propagate to server Y, lightning strikes server X		- Before the change can propagate to server Y, lightning strikes server X
and destroys it.		and destroys it.

Here, all of the "leader" devices with the most up-to-date copy of the		Here, all of the "leader" devices with the most up-to-date copy of the
repository have been lost. Phabricator will freeze the repository refuse to		repository have been lost. Phorge will freeze the repository refuse to
serve requests because it can not serve reads consistently and can not accept		serve requests because it can not serve reads consistently and can not accept
new writes without data loss.		new writes without data loss.

The most straightforward way to resolve this issue is to restore any leader to		The most straightforward way to resolve this issue is to restore any leader to
service. The change will be able to replicate to other devices once a leader		service. The change will be able to replicate to other devices once a leader
comes back online.		comes back online.

If you are unable to restore a leader or unsure that you can restore one		If you are unable to restore a leader or unsure that you can restore one
quickly, you can use the monitoring console to review which changes are		quickly, you can use the monitoring console to review which changes are
present on the leaders but not present on the followers by examining the		present on the leaders but not present on the followers by examining the
push logs.		push logs.

If you are comfortable discarding these changes, you can instruct Phabricator		If you are comfortable discarding these changes, you can instruct Phorge
that it can forget about the leaders by doing this:		that it can forget about the leaders by doing this:

- Disable the service bindings to all of the leader devices so they are no		- Disable the service bindings to all of the leader devices so they are no
longer part of the cluster.		longer part of the cluster.
- Then, use `bin/repository thaw` to `--demote` the leaders explicitly.		- Then, use `bin/repository thaw` to `--demote` the leaders explicitly.

To demote a device, run this command:		To demote a device, run this command:

```		```
phabricator/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net		phorge/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net
```		```

{icon exclamation-triangle, color="red"} Any data which is only present on		{icon exclamation-triangle, color="red"} Any data which is only present on
the demoted device will be lost.		the demoted device will be lost.

If you do this, you will lose unreplicated data. You will discard any		If you do this, you will lose unreplicated data. You will discard any
changes on the affected leaders which have not replicated to other devices		changes on the affected leaders which have not replicated to other devices
in the cluster.		in the cluster.

If you have lost an entire cluster and replaced it with new devices that you		If you have lost an entire cluster and replaced it with new devices that you
have restored from backups, you can aggressively wipe all memory of the old		have restored from backups, you can aggressively wipe all memory of the old
devices by using `--demote <service>` and `--all-repositories`. **This is		devices by using `--demote <service>` and `--all-repositories`. **This is
dangerous and discards all unreplicated data in any repository on any device.**		dangerous and discards all unreplicated data in any repository on any device.**

```		```
phabricator/ $ ./bin/repository thaw --demote repo.corp.net --all-repositories		phorge/ $ ./bin/repository thaw --demote repo.corp.net --all-repositories
```		```

After you do this, continue below to promote a leader and restore the cluster		After you do this, continue below to promote a leader and restore the cluster
to service.		to service.


Ambiguous Leaders		Ambiguous Leaders
=================		=================

Repository clusters can also freeze if the leader devices are ambiguous. This		Repository clusters can also freeze if the leader devices are ambiguous. This
can happen if you replace an entire cluster with new devices suddenly, or make		can happen if you replace an entire cluster with new devices suddenly, or make
a mistake with the `--demote` flag. This may arise from some kind of operator		a mistake with the `--demote` flag. This may arise from some kind of operator
error, like these:		error, like these:

- Someone accidentally uses `bin/repository thaw ... --demote` to demote		- Someone accidentally uses `bin/repository thaw ... --demote` to demote
every device in a cluster.		every device in a cluster.
- Someone accidentally deletes all the version information for a repository		- Someone accidentally deletes all the version information for a repository
from the database by making a mistake with a `DELETE` or `UPDATE` query.		from the database by making a mistake with a `DELETE` or `UPDATE` query.
- Someone accidentally disables all of the devices in a cluster, then adds		- Someone accidentally disables all of the devices in a cluster, then adds
entirely new ones before repositories can propagate.		entirely new ones before repositories can propagate.

If you are moving repositories into cluster services, you can also reach this		If you are moving repositories into cluster services, you can also reach this
state if you use `clusterize` to associate a repository with a service that is		state if you use `clusterize` to associate a repository with a service that is
bound to multiple active devices. In this case, Phabricator will not know which		bound to multiple active devices. In this case, Phorge will not know which
device or devices have up-to-date information.		device or devices have up-to-date information.

When Phabricator can not tell which device in a cluster is a leader, it freezes		When Phorge can not tell which device in a cluster is a leader, it freezes
the cluster because it is possible that some devices have less data and others		the cluster because it is possible that some devices have less data and others
have more, and if it chooses a leader arbitrarily it may destroy some data		have more, and if it chooses a leader arbitrarily it may destroy some data
which you would prefer to retain.		which you would prefer to retain.

To resolve this, you need to tell Phabricator which device has the most		To resolve this, you need to tell Phorge which device has the most
up-to-date data and promote that device to become a leader. If you know all		up-to-date data and promote that device to become a leader. If you know all
devices have the same data, you are free to promote any device.		devices have the same data, you are free to promote any device.

If you promote a device, you may lose data if you promote the wrong device		If you promote a device, you may lose data if you promote the wrong device
and some other device really had more up-to-date data. If you want to double		and some other device really had more up-to-date data. If you want to double
check, you can examine the working copies on disk before promoting by		check, you can examine the working copies on disk before promoting by
connecting to the machines and using commands like `git log` to inspect state.		connecting to the machines and using commands like `git log` to inspect state.

Once you have identified a device which has data you're happy with, use		Once you have identified a device which has data you're happy with, use
`bin/repository thaw` to `--promote` the device. The data on the chosen		`bin/repository thaw` to `--promote` the device. The data on the chosen
device will become authoritative:		device will become authoritative:

```		```
phabricator/ $ ./bin/repository thaw rXYZ --promote repo002.corp.net		phorge/ $ ./bin/repository thaw rXYZ --promote repo002.corp.net
```		```

{icon exclamation-triangle, color="red"} Any data which is only present on		{icon exclamation-triangle, color="red"} Any data which is only present on
other devices will be lost.		other devices will be lost.


Backups		Backups
======		======

Even if you configure clustering, you should still consider retaining separate		Even if you configure clustering, you should still consider retaining separate
backup snapshots. Replicas protect you from data loss if you lose a host, but		backup snapshots. Replicas protect you from data loss if you lose a host, but
they do not let you rewind time to recover from data mutation mistakes.		they do not let you rewind time to recover from data mutation mistakes.

If something issues a `--force` push that destroys branch heads, the mutation		If something issues a `--force` push that destroys branch heads, the mutation
will propagate to the replicas.		will propagate to the replicas.

You may be able to manually restore the branches by using tools like the		You may be able to manually restore the branches by using tools like the
Phabricator push log or the Git reflog so it is less important to retain		Phorge push log or the Git reflog so it is less important to retain
repository snapshots than database snapshots, but it is still possible for		repository snapshots than database snapshots, but it is still possible for
data to be lost permanently, especially if you don't notice the problem for		data to be lost permanently, especially if you don't notice the problem for
some time.		some time.

Retaining separate backup snapshots will improve your ability to recover more		Retaining separate backup snapshots will improve your ability to recover more
data more easily in a wider range of disaster situations.		data more easily in a wider range of disaster situations.


Show All 12 Lines
can use a maintenance lock to safely make a working copy mutable.		can use a maintenance lock to safely make a working copy mutable.

If you simply perform this kind of content-modifying maintenance by directly		If you simply perform this kind of content-modifying maintenance by directly
modifying the repository on disk with commands like `git update-ref`, your		modifying the repository on disk with commands like `git update-ref`, your
changes may either encounter conflicts or encounter problems with change		changes may either encounter conflicts or encounter problems with change
propagation.		propagation.

You can encounter conflicts because directly modifying the working copy on disk		You can encounter conflicts because directly modifying the working copy on disk
won't prevent users or Phabricator itself from performing writes to the same		won't prevent users or Phorge itself from performing writes to the same
working copy at the same time. Phabricator does not compromise the lower-level		working copy at the same time. Phorge does not compromise the lower-level
locks provided by the VCS so this is theoretically safe -- and this rarely		locks provided by the VCS so this is theoretically safe -- and this rarely
causes any significant problems in practice -- but doesn't make things any		causes any significant problems in practice -- but doesn't make things any
simpler or easier.		simpler or easier.

Your changes may fail to propagate because writing directly to the repository		Your changes may fail to propagate because writing directly to the repository
doesn't turn it into the new cluster leader after your writes complete. If		doesn't turn it into the new cluster leader after your writes complete. If
another node accepts the next push, it will become the new leader -- without		another node accepts the next push, it will become the new leader -- without
your changes -- and all other nodes will synchronize from it.		your changes -- and all other nodes will synchronize from it.
▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines