Explain upstream attitudes toward CLI exit codes

+243

1 changed file

expand all

src

docs

user

field

exit_codes.diviner

+243

src/docs/user/field/exit_codes.diviner

··· 1 + @title Command Line Exit Codes 2 + @group fieldmanual 3 + 4 + Explains the use of exit codes in Phabricator command line scripts. 5 + 6 + Overview 7 + ======== 8 + 9 + When you run a command from the command line, it exits with an //exit code//. 10 + This code is normally not shown on the CLI, but you can examine the exit code 11 + of the last command you ran by looking at `$?` in your shell: 12 + 13 + $ ls 14 + ... 15 + $ echo $? 16 + 0 17 + 18 + Programs which run commands can operate on exit codes, and shell constructs 19 + like `cmdx && cmdy` operate on exit codes. 20 + 21 + The code `0` means success. Other codes signal some sort of error or status 22 + condition, depending on the system and command. 23 + 24 + With rare exception, Phabricator uses //all other codes// to signal 25 + **catastrophic failure**. 26 + 27 + This is an explicit architectural decision and one we are unlikely to deviate 28 + from: generally, we will not accept patches which give a command a nonzero exit 29 + code to indicate an expected state, an application status, or a minor abnormal 30 + condition. 31 + 32 + Generally, this decision reflects a philosophical belief that attaching 33 + application semantics to exit codes is a relic of a simpler time, and that 34 + they are not appropriate for communicating application state in a modern 35 + operational environment. This document explains the reasoning behind our use of 36 + exit codes in more detail. 37 + 38 + In particular, this approach is informed by a focus on operating Phabricator 39 + clusters at scale. This is not a common deployment scenario, but we consider it 40 + the most important one. Our use of exit codes makes it easier to deploy and 41 + operate a Phabricator cluster at larger scales. It makes it slightly harder to 42 + deploy and operate a small cluster or single host by gluing together `bash` 43 + scripts. We are willingly trading the small scale away for advantages at larger 44 + scales. 45 + 46 + 47 + Problems With Exit Codes 48 + ======================== 49 + 50 + We do not use exit codes to communicate application state because doing so 51 + makes it harder to write correct scripts, and the primary benefit is that it 52 + makes it easier to write incorrect ones. 53 + 54 + This is somewhat at odds with the philosophy of "worse is better", but a modern 55 + operations environment faces different forces than the interactive shell did 56 + in the 1970s, particularly at scale. 57 + 58 + We consider correctness to be very important to modern operations environments. 59 + In particular, we manage a Phabricator cluster (Phacility) and believe that 60 + having reliable, repeatable processes for provisioning, configuration and 61 + deployment is critical to maintaining and scaling our operations. Our use of 62 + exit codes makes it easier to implement processes that are correct and reliable 63 + on top of Phabricator management scripts. 64 + 65 + Exit codes as signals for application state are problematic because they are 66 + ambiguous: you can't use them to distinguish between dissimilar failure states 67 + which should prompt very different operational responses. 68 + 69 + Exit codes primarily make writing things like `bash` scripts easier, but we 70 + think you shouldn't be writing `bash` scripts in a modern operational 71 + environment if you care very much about your software working. 72 + 73 + Software environments which are powerful enough to handle errors properly are 74 + also powerful enough to parse command output to unambiguously read and react to 75 + complex state. Communicating application state through exit codes almost 76 + exclusively makes it easier to handle errors in a haphazard way which is often 77 + incorrect. 78 + 79 + 80 + Exit Codes are Ambiguous 81 + ======================== 82 + 83 + In many cases, exit codes carry very little information and many different 84 + conditions can produce the same exit code, including conditions which should 85 + prompt very different responses. 86 + 87 + The command line tool `grep` searches for text. For example, you might run 88 + a command like this: 89 + 90 + $ grep zebra corpus.txt 91 + 92 + This searches for the text `zebra` in the file `corpus.txt`. If the text is 93 + not found, `grep` exits with a nonzero exit code (specifically, `1`). 94 + 95 + Suppose you run `grep zebra corpus.txt` and observe a nonzero exit code. What 96 + does that mean? These are //some// of the possible conditions which are 97 + consistent with your observation: 98 + 99 + - The text `zebra` was not found in `corpus.txt`. 100 + - `corpus.txt` does not exist. 101 + - You do not have permission to read `corpus.txt`. 102 + - `grep` is not installed. 103 + - You do not have permission to run `grep`. 104 + - There is a bug in `grep`. 105 + - Your `grep` binary is corrupt. 106 + - `grep` was killed by a signal. 107 + 108 + If you're running this command interactively on a single machine, it's probably 109 + OK for all of these conditions to be conflated. You aren't going to examine the 110 + exit code anyway (it isn't even visible to you by default), and `grep` likely 111 + printed useful information to `stderr` if you hit one of the less common issues. 112 + 113 + If you're running this command from operational software (like deployment, 114 + configuration or monitoring scripts) and you care about the correctness and 115 + repeatability of your process, we believe conflating these conditions is not 116 + OK. The operational response to text not being present in a file should almost 117 + always differ substantially from the response to the file not being present or 118 + `grep` being broken. 119 + 120 + In a particularly bad case, a broken `grep` might cause a careless deployment 121 + script to continue down an inappropriate path and cascade into a more serious 122 + failure. 123 + 124 + Even in a less severe case, unexpected conditions should be detected and raised 125 + to operations staff. `grep` being broken or a file that is expected to exist 126 + not existing are both detectable, unexpected, and likely severe conditions, but 127 + they can not be differentiated and handled by examining the exit code of 128 + `grep`. It is much better to detect and raise these problems immediately than 129 + discover them after a lengthy root cause analysis. 130 + 131 + Some of these conditions can be differentiated by examining the specific exit 132 + code of the command instead of acting on all nonzero exit codes. However, many 133 + failure conditions produce the same exit codes (particularly code `1`) and 134 + there is no way to guarantee that a particular code signals a particular 135 + condition, especially across systems. 136 + 137 + Realistically, it is also relatively rare for scripts to even make an effort to 138 + distinguish between exit codes, and all nonzero exit codes are often treated 139 + the same way. 140 + 141 + 142 + Bash Scripts are not Robust 143 + ============================ 144 + 145 + Exit codes that indicate application status make writing `bash` scripts (or 146 + scripts in other tools which provide a thin layer on top of what is essentially 147 + `bash`) a lot easier and more convenient. 148 + 149 + For example, it is pretty tricky to parse JSON in `bash` or with standard 150 + command-line tools, and much easier to react to exit codes. This is sometimes 151 + used as an argument for communicating application status in exit codes. 152 + 153 + We reject this because we don't think you should be writing `bash` scripts if 154 + you're doing real operations. Funadmentally, `bash` shell scripts are not a 155 + robust building block for creating correct, reliable operational processes. 156 + 157 + Here is one problem with using `bash` scripts to perform operational tasks. 158 + Consider this command: 159 + 160 + $ mysqldump | gzip > backup.sql.gz 161 + 162 + Now, consider this command: 163 + 164 + $ mysqldermp | gzip > backup.sql.gz 165 + 166 + These commands represent a fairly standard way to accomplish a task (dumping 167 + a compressed database backup to disk) in a `bash` script. 168 + 169 + Note that the second command contains a typo (`dermp` instead of `dump`) which 170 + will cause the command to exit abruptly with a nonzero exit code. 171 + 172 + However, both these statements run successfully and exit with exit code `0` 173 + (indicating success). Both will create a `backup.sql.gz` file. One backs up 174 + your data; the other never backs up your data. This second command will never 175 + work and never do what the author intended, but will appear successful under 176 + casual inspection. 177 + 178 + These behaviors are the same under `set -e`. 179 + 180 + This fragile attitude toward error handling is endemic to `bash` scripts. The 181 + default behavior is to continue on errors, and it isn't easy to change this 182 + default. Options like `set -e` are unreliable and it is difficult to detect and 183 + react to errors in fundamental constructs like pipes. The tools that `bash` 184 + scripts employ (like `grep`) emit ambiguous error codes. Scripts can not help 185 + but propagate this ambiguity no matter how careful they are with error handling. 186 + 187 + It is likely //possible// to implement these things safely and correctly in 188 + `bash`, but it is not easy or straightforward. More importantly, it is not the 189 + default: the default behavior of `bash` is to ignore errors and continue. 190 + 191 + Gluing commands together in `bash` or something that sits on top of `bash` 192 + makes it easy and convenient to get a process that works fairly well most of 193 + the time at small scales, but we are not satisfied that it represents a robust 194 + foundation for operations at larger scales. 195 + 196 + 197 + Reacting to State 198 + ================= 199 + 200 + Instead of communicating application state through exit codes, we generally 201 + communicate application state through machine-parseable output with a success 202 + (`0`) exit code. All nonzero exit codes indicate catastrophic failure which 203 + requires operational intervention. 204 + 205 + Callers are expected to request machine-parseable output if necessary (for 206 + example, by passing a `--json` flag or other similar flags), verify the command 207 + exits with a `0` exit code, parse the output, then react to the state it 208 + communicates as appropriate. 209 + 210 + In a sufficiently powerful scripting environment (e.g., one with data 211 + structures and a JSON parser), this is straightforward and makes it easy to 212 + react precisely and correctly. It also allows scripts to communicate 213 + arbitrarily complex state. Provided your environment gives you an appropriate 214 + toolset, it is much more powerful and not significantly more complex than using 215 + error codes. 216 + 217 + Most importantly, it allows the calling environment to treat nonzero exit 218 + statuses as catastrophic failure by default. 219 + 220 + 221 + Moving Forward 222 + ============== 223 + 224 + Given these concerns, we are generally unwilling to bring changes which use 225 + exit codes to communicate application state (other than catastrophic failure) 226 + into the upstream. There are some exceptions, but these are rare. In 227 + particular, ease of use in a `bash` environment is not a compelling motivation. 228 + 229 + We are broadly willing to make output machine parseable or provide an explicit 230 + machine output mode (often a `--json` flag) if there is a reasonable use case 231 + for it. However, we operate a large production cluster of Phabricator instances 232 + with the tools available in the upstream, so the lack of machine parseable 233 + output is not sufficient to motivate adding such output on its own: we also 234 + need to understand the problem you're facing, and why it isn't a problem we 235 + face. A simpler or cleaner approach to the problem may already exist. 236 + 237 + If you just want to write `bash` scripts on top of Phabricator scripts and you 238 + are unswayed by these concerns, you can often just build a composite command to 239 + get roughly the same effect that you'd get out of an exit code. 240 + 241 + For example, you can pipe things to `grep` to convert output into exit codes. 242 + This should generally have failure rates that are comparable to the background 243 + failure level of relying on `bash` as a scripting environment.

Configure Feed

Configure Feed