Merge pull request #33 from cuducos/clean-up-readme

+38 -74

1 changed file

expand all

README.md

+38 -74

README.md

··· 3 3 [![Tests](https://github.com/cuducos/chunk/actions/workflows/tests.yaml/badge.svg)](https://github.com/cuducos/chunk/actions/workflows/tests.yaml) 4 4 [![Format](https://github.com/cuducos/chunk/actions/workflows/gofmt.yaml/badge.svg)](https://github.com/cuducos/chunk/actions/workflows/gofmt.yaml) 5 5 [![Lint](https://github.com/cuducos/chunk/actions/workflows/golint.yaml/badge.svg)](https://github.com/cuducos/chunk/actions/workflows/golint.yaml) 6 + [![GoDoc](https://godoc.org/github.com/cuducos/chunk?status.svg)](https://godoc.org/github.com/cuducos/chunk) 6 7 7 8 Chunk a download tool for slow and unstable servers. 8 9 9 - The idea of the project emerged as it was difficult for [Minha Receita](https://github.com/cuducos/minha-receita) to handle the download of [37 files that adds up to just approx. 5Gb](https://www.gov.br/receitafederal/pt-br/assuntos/orientacao-tributaria/cadastros/consultas/dados-publicos-cnpj). Most of the download solutions out there (e.g. [`got`](https://github.com/melbahja/got)) seem to be prepared for downloading large files, not for downloading from slow and unstable servers — which is the case at hand. 10 - 11 - ## Main fetaures 12 - 13 - ### Download using HTTP range requests 14 - 15 - In order to complete downloads from slow and unstable servers, the download should be done in “chunks” using [HTTP range requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests). This does not rely on long-standing HTTP connections, and it makes it predictable the idea of how long is too long for a non-response. 16 - 17 - ### Retries by chunk, not by file 10 + ## Usage 18 11 19 - In order to be quicker and avoid rework, the primary way to handle failure is to retry that “chunk” (that bytes range), not the whole file. 20 - 21 - ### Control of which chunks are already downloaded 12 + ### CLI 22 13 23 - In order to avoid re-starting from the beginning in case of non-handled errors, `chunk` knows which ranges from each file were already downloaded; so, when restarted, it only downloads what is really needed to complete the downloads. 14 + Install it with `go install github.com/cuducos/chunk` then: 24 15 25 - ### Detect server failures and give it a break 16 + ```console 17 + $ chunk <URLs> 18 + ``` 26 19 27 - In order to avoid unnecessary stress on the server, `chunk` relies not only on HTTP responses but also on other signs that the connection is stale and can: 20 + Use `--help` for detailed instructions. 28 21 29 - 1. recover from that and 30 - 2. give the server some time to recover from stress. 22 + ### API 31 23 32 - ## Tech design 24 + The [`Download`](https://pkg.go.dev/github.com/cuducos/chunk#Download) method returns a channel with [`DownloadStatus`](https://pkg.go.dev/github.com/cuducos/chunk#DownloadStatus) statuses. This channel is closed once all downloads are finished, but the user is in charge of handling errors. 33 25 34 - ### Input 26 + #### Simplest use case 35 27 36 - * List of URLs 37 - * Directory where to save the files 38 - * Configuration (they can have defaults and be optional; customizing them can be a stretch goal): 39 - * Chunck download attempt timeout 40 - * Maximum parallel connection to each server 41 - * Max retries per chunk (must have an option to unlimited) 42 - * Range maximum size (chunk size) 43 - * Time to wait on server failure 28 + ```go 29 + d := chunk.DefaultDownloader() 30 + ch := d.Dowload(urls) 31 + ``` 44 32 45 - ### Prepare downloads 33 + #### Customizing some options 46 34 47 - For each URL of the list (this can be done in parallel): 35 + ```go 36 + d := chunk.DefaultDownloader() 37 + d.MaxRetries = 42 38 + ch := d.Dowload(urls) 39 + ``` 48 40 49 - * Make sure the server accepts HTTP range requests (stretch goal) 50 - * Can fail if it doesn't 51 - * Or can default to regular HTTP request to download 52 - * Find out the file total size 53 - * Determine all the chunks to be downloaded (each start and end bytes) 54 - * Read or create a temporary control of chunks downloaded and pending chunks 55 - * Enqueue all the pending chunks 41 + #### Customizing everything 56 42 57 - With all this information, show a progress bar with the total work remaining. 43 + ```go 44 + d := chunk.Downloader{...} 45 + ch := d.Download(urls) 46 + ``` 58 47 59 - ### Download 48 + ## How? 60 49 61 - * Set a timeout 62 - * Start the HTTP range request 63 - * In case of failure or timeout, re-queue this chunk 64 - * In case of success, send the chunk contents to a `results` channel 50 + It uses HTTP range requests, retries per HTTP request (not per file), prevents re-downloading the same content range and supports wait time to give servers time to recover. 65 51 66 - ### Writing files 52 + ### Download using HTTP range requests 67 53 68 - * Read the bytes from the `results` channel 69 - * Write to the file on disk 70 - * Update a progress bar to give the user an idea about the status of the downloads 54 + In order to complete downloads from slow and unstable servers, the download should be done in “chunks” using [HTTP range requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests). This does not rely on long-standing HTTP connections, and it makes it predictable the idea of how long is too long for a non-response. 71 55 72 - ## Prototype 56 + ### Retries by chunk, not by file 73 57 74 - The prototype is a CLI that wraps a GET HTTP request in a 45s timeout independent of the HTTP client's timeout. It also includes 3 retries by default. 58 + In order to be quicker and avoid rework, the primary way to handle failure is to retry that “chunk” (content range), not the whole file. 75 59 76 - ```console 77 - $ go run main.go <URL> # e.g. go run main.go https://github.com/cuducos/chunk/archive/refs/heads/main.zip 78 - ``` 60 + ### Control of which chunks are already downloaded 79 61 80 - The API should work like this: 62 + In order to avoid re-starting from the beginning in case of non-handled errors, `chunk` knows which ranges from each file were already downloaded; so, when restarted, it only downloads what is really needed to complete the downloads. 81 63 82 - ```go 83 - // simple use case 84 - d := NewDownloader() 85 - ch := d.Dowload(urls) 64 + ### Detect server failures and give it a break 86 65 87 - // partial customization 88 - d := NewDownloader() 89 - d.MaxRetriesPerChunk = 42 90 - ch := d.Dowload(urls) 66 + In order to avoid unnecessary stress on the server, `chunk` relies not only on HTTP responses but also on other signs that the connection is stale and can recover from that and give the server some time to recover from stress. 91 67 92 - // full control 93 - d := chunk.Downloader{...} 94 - ch := d.Download(urls) 95 - ``` 68 + ## Why? 96 69 97 - The resulting channel will transmit data about each download: 70 + The idea of the project emerged as it was difficult for [Minha Receita](https://github.com/cuducos/minha-receita) to handle the download of [37 files that adds up to just approx. 5Gb](https://www.gov.br/receitafederal/pt-br/assuntos/orientacao-tributaria/cadastros/consultas/dados-publicos-cnpj). Most of the download solutions out there (e.g. [`got`](https://github.com/melbahja/got)) seem to be prepared for downloading large files, not for downloading from slow and unstable servers — which is the case at hand. 98 71 99 - ```go 100 - type DownloadStatus struct { 101 - URL string 102 - DownloadedFilePath string 103 - FileSizeBytes uint64 104 - DownloadedFileBytes uint64 105 - Error error 106 - } 107 - ```

Configure Feed

Configure Feed