@recaptime-dev's working patches + fork for Phorge, a community fork of Phabricator. (Upstream dev and stable branches are at upstream/main and upstream/stable respectively.) hq.recaptime.dev/wiki/Phorge
phorge phabricator
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at recaptime-dev/main 342 lines 14 kB view raw
1@title Clustering Introduction 2@group cluster 3 4Guide to configuring Phorge across multiple hosts for availability and 5performance. 6 7 8Overview 9======== 10 11WARNING: This feature is a prototype. Installs should expect a challenging 12adventure when deploying clusters. In the best of times, configuring a 13cluster is complex and requires significant operations experience. 14 15Phorge can be configured to run on multiple hosts with redundant services 16to improve its availability and scalability, and make disaster recovery much 17easier. 18 19Clustering is more complex to setup and maintain than running everything on a 20single host, but greatly reduces the cost of recovering from hardware and 21network failures. 22 23Each Phorge service has an array of clustering options that can be 24configured somewhat independently. Configuring a cluster is inherently complex, 25and this is an advanced feature aimed at installs with large userbases and 26experienced operations personnel who need this high degree of flexibility. 27 28The remainder of this document summarizes how to add redundancy to each 29service and where your efforts are likely to have the greatest impact. 30 31For additional guidance on setting up a cluster, see "Overlaying Services" 32and "Cluster Recipes" at the bottom of this document. 33 34 35Clusterable Services 36==================== 37 38This table provides an overview of clusterable services, their setup 39complexity, and the rough impact that converting them to run on multiple hosts 40will have on availability, resistance to data loss, and scalability. 41 42| Service | Setup | Availability | Loss Resistance | Scalability 43|---------|-------|--------------|-----------|------------ 44| **Databases** | Moderate | **High** | **High** | Low 45| **Repositories** | Complex | Moderate | **High** | Moderate 46| **Daemons** | Minimal | Low | No Risk | Low 47| **SSH Servers** | Minimal | Low | No Risk | Low 48| **Web Servers** | Minimal | **High** | No Risk | Moderate 49| **Notifications** | Minimal | Low | No Risk | Low 50| **Fulltext Search** | Minimal | Low | No Risk | Low 51 52See below for a walkthrough of these services in greater detail. 53 54 55Preparing for Clustering 56======================== 57 58To begin deploying Phorge in cluster mode, set up `cluster.addresses` 59in your configuration. 60 61This option should contain a list of network address blocks which are considered 62to be part of the cluster. Hosts in this list are allowed to bend (or even 63break) some of the security and policy rules when they make requests to other 64hosts in the cluster, so this list should be as small as possible. See "Cluster 65Allowlist Security" below for discussion. 66 67If you are deploying hardware in EC2, a reasonable approach is to launch a 68dedicated Phorge VPC, put the whole VPC into the allowlist as a Phorge cluster, 69and then deploy only Phorge services into that VPC. 70 71If you have additional auxiliary hosts which run builds and tests via Drydock, 72you should //not// include them in the cluster address definition. For more 73detailed discussion of the Drydock security model, see 74@{article:Drydock User Guide: Security}. 75 76Most other clustering features will not work until you define a cluster by 77configuring `cluster.addresses`. 78 79 80Cluster Allowlist Security 81======================== 82 83When you configure `cluster.addresses`, you should keep the list of trusted 84cluster hosts as small as possible. Hosts on this list gain additional 85capabilities, including these: 86 87**Trusted HTTP Headers**: Normally, Phorge distrusts the load balancer 88HTTP headers `X-Forwarded-For` and `X-Forwarded-Proto` because they may be 89client-controlled and can be set to arbitrary values by an attacker if no load 90balancer is deployed. In particular, clients can set `X-Forwarded-For` to any 91value and spoof traffic from arbitrary remotes. 92 93These headers are trusted when they are received from a host on the cluster 94address allowlist. This allows requests from cluster loadbalancers to be 95interpreted correctly by default without requiring additional custom code or 96configuration. 97 98**Intracluster HTTP**: Requests from cluster hosts are not required to use 99HTTPS, even if `security.require-https` is enabled, because it is common to 100terminate HTTPS on load balancers and use plain HTTP for requests within a 101cluster. 102 103**Special Authentication Mechanisms**: Cluster hosts are allowed to connect to 104other cluster hosts with "root credentials", and to impersonate any user 105account. 106 107The use of root credentials is required because the daemons must be able to 108bypass policies in order to function properly: they need to send mail about 109private conversations and import commits in private repositories. 110 111The ability to impersonate users is required because SSH nodes must receive, 112interpret, modify, and forward SSH traffic. They can not use the original 113credentials to do this because SSH authentication is asymmetric and they do not 114have the user's private key. Instead, they use root credentials and impersonate 115the user within the cluster. 116 117These mechanisms are still authenticated (and use asymmetric keys, like SSH 118does), so access to a host in the cluster address block does not mean that an 119attacker can immediately compromise the cluster. However, an over-broad cluster 120address allowlist may give an attacker who gains some access additional tools 121to escalate access. 122 123Note that if an attacker gains access to an actual cluster host, these extra 124powers are largely moot. Most cluster hosts must be able to connect to the 125master database to function properly, so the attacker will just do that and 126freely read or modify whatever data they want. 127 128 129Cluster: Databases 130================= 131 132Configuring multiple database hosts is moderately complex, but normally has the 133highest impact on availability and resistance to data loss. This is usually the 134most important service to make redundant if your focus is on availability and 135disaster recovery. 136 137Configuring replicas allows Phorge to run in read-only mode if you lose 138the master and to quickly promote the replica as a replacement. 139 140For details, see @{article:Cluster: Databases}. 141 142 143Cluster: Repositories 144===================== 145 146Configuring multiple repository hosts is complex, but is required before you 147can add multiple daemon or web hosts. 148 149Repository replicas are important for availability if you host repositories 150on Phorge, but less important if you host repositories elsewhere 151(instead, you should focus on making that service more available). 152 153The distributed nature of Git and Mercurial tend to mean that they are 154naturally somewhat resistant to data loss: every clone of a repository includes 155the entire history. 156 157Repositories may become a scalability bottleneck, although this is rare unless 158your install has an unusually heavy repository read volume. Slow clones/fetches 159may hint at a repository capacity problem. Adding more repository hosts will 160provide an approximately linear increase in capacity. 161 162For details, see @{article:Cluster: Repositories}. 163 164 165Cluster: Daemons 166================ 167 168Configuring multiple daemon hosts is straightforward, but you must configure 169repositories first. 170 171With daemons running on multiple hosts you can transparently survive the loss 172of any subset of hosts without an interruption to daemon services, as long as 173at least one host remains alive. Daemons are stateless, so spreading daemons 174across multiple hosts provides no resistance to data loss. 175 176Daemons can become a bottleneck, particularly if your install sees a large 177volume of write traffic to repositories. If the daemon task queue has a 178backlog, that hints at a capacity problem. If existing hosts have unused 179resources, increase `phd.taskmasters` until they are fully utilized. From 180there, adding more daemon hosts will provide an approximately linear increase 181in capacity. 182 183For details, see @{article:Cluster: Daemons}. 184 185 186Cluster: SSH Servers 187==================== 188 189Configuring multiple SSH hosts is straightforward, but you must configure 190repositories first. 191 192With multiple SSH hosts you can transparently survive the loss of any subset 193of hosts without interruption to repository services, as long as at last one 194host remains alive. SSH services are stateless, so putting multiple hosts in 195service provides no resistance to data loss because no data is at risk. 196 197SSH hosts are very rarely a scalability bottleneck. 198 199For details, see @{article:Cluster: SSH Servers}. 200 201 202Cluster: Web Servers 203==================== 204 205Configuring multiple web hosts is straightforward, but you must configure 206repositories first. 207 208With multiple web hosts you can transparently survive the loss of any subset 209of hosts as long as at least one host remains alive. Web services are stateless, 210so putting multiple hosts in service provides no resistance to data loss 211because no data is at risk. 212 213Web hosts can become a bottleneck, particularly if you have a workload that is 214heavily focused on reads from the web UI (like a public install with many 215anonymous users). Slow responses to web requests may hint at a web capacity 216problem. Adding more hosts will provide an approximately linear increase in 217capacity. 218 219For details, see @{article:Cluster: Web Servers}. 220 221 222Cluster: Notifications 223====================== 224 225Configuring multiple notification hosts is simple and has no pre-requisites. 226 227With multiple notification hosts, you can survive the loss of any subset of 228hosts as long as at least one host remains alive. Service may be briefly 229disrupted directly after the incident which destroys the other hosts. 230 231Notifications are noncritical, so this normally has little practical impact 232on service availability. Notifications are also stateless, so clustering this 233service provides no resistance to data loss because no data is at risk. 234 235Notification delivery normally requires very few resources, so adding more 236hosts is unlikely to have much impact on scalability. 237 238For details, see @{article:Cluster: Notifications}. 239 240 241Cluster: Fulltext Search 242======================== 243 244Configuring search services is relatively simple and has no pre-requisites. 245 246By default, Phorge uses MySQL as a fulltext search engine, so deploying 247multiple database hosts will effectively also deploy multiple fulltext search 248hosts. 249 250Search indexes can be completely rebuilt from the database, so there is no 251risk of data loss no matter how fulltext search is configured. 252 253For details, see @{article:Cluster: Search}. 254 255 256Overlaying Services 257=================== 258 259Although hosts can run a single dedicated service type, certain groups of 260services work well together. Phorge clusters usually do not need to be 261very large, so deploying a small number of hosts with multiple services is a 262good place to start. 263 264In planning a cluster, consider these blended host types: 265 266**Everything**: Run HTTP, SSH, MySQL, notifications, repositories and daemons 267on a single host. This is the starting point for single-node setups, and 268usually also the best configuration when adding the second node. 269 270**Everything Except Databases**: Run HTTP, SSH, notifications, repositories and 271daemons on one host, and MySQL on a different host. MySQL uses many of the same 272resources that other services use. It's also simpler to separate than other 273services, and tends to benefit the most from dedicated hardware. 274 275**Repositories and Daemons**: Run repositories and daemons on the same host. 276Repository hosts //must// run daemons, and it normally makes sense to 277completely overlay repositories and daemons. These services tend to use 278different resources (repositories are heavier on I/O and lighter on CPU/RAM; 279daemons are heavier on CPU/RAM and lighter on I/O). 280 281Repositories and daemons are also both less latency sensitive than other 282service types, so there's a wider margin of error for under provisioning them 283before performance is noticeably affected. 284 285These nodes tend to use system resources in a balanced way. Individual nodes 286in this class do not need to be particularly powerful. 287 288**Frontend Servers**: Run HTTP and SSH on the same host. These are easy to set 289up, stateless, and you can scale the pool up or down easily to meet demand. 290Routing both types of ingress traffic through the same initial tier can 291simplify load balancing. 292 293These nodes tend to need relatively little RAM. 294 295 296Cluster Recipes 297=============== 298 299This section provides some guidance on reasonable ways to scale up a cluster. 300 301The smallest possible cluster is **two hosts**. Run everything (web, ssh, 302database, notifications, repositories, and daemons) on each host. One host will 303serve as the master; the other will serve as a replica. 304 305Ideally, you should physically separate these hosts to reduce the chance that a 306natural disaster or infrastructure disruption could disable or destroy both 307hosts at the same time. 308 309From here, you can choose how you expand the cluster. 310 311To improve **scalability and performance**, separate loaded services onto 312dedicated hosts and then add more hosts of that type to increase capacity. If 313you have a two-node cluster, the best way to improve scalability by adding one 314host is likely to separate the master database onto its own host. 315 316Note that increasing scale may //decrease// availability by leaving you with 317too little capacity after a failure. If you have three hosts handling traffic 318and one datacenter fails, too much traffic may be sent to the single remaining 319host in the surviving datacenter. You can hedge against this by mirroring new 320hosts in other datacenters (for example, also separate the replica database 321onto its own host). 322 323After separating databases, separating repository + daemon nodes is likely 324the next step to consider. 325 326To improve **availability**, add another copy of everything you run in one 327datacenter to a new datacenter. For example, if you have a two-node cluster, 328the best way to improve availability is to run everything on a third host in a 329third datacenter. If you have a 6-node cluster with a web node, a database node 330and a repo + daemon node in two datacenters, add 3 more nodes to create a copy 331of each node in a third datacenter. 332 333You can continue adding hosts until you run out of hosts. 334 335 336Next Steps 337========== 338 339Continue by: 340 341 - learning how Phacility configures and operates a large, multi-tenant 342 production cluster in ((cluster)).