1st commit · cuducos.me/chunk@ced2ebb

+64

1 changed file

expand all

README.md

+64

README.md

··· 1 + # Chunk 2 + 3 + `chunk` is a sort of download manager written in pure Go. The idea of the project emerged as it was difficult for [Minha Receita](https://github.com/cuducos/minha-receita) to handle the download of [37 files that adds up to just approx. 5Gb](https://www.gov.br/receitafederal/pt-br/assuntos/orientacao-tributaria/cadastros/consultas/dados-publicos-cnpj). Most of the download solutions out there (e.g. [`got`](https://github.com/melbahja/got)) seem to be prepared for downloading large files, not for downloading from slow and unstable servers — which is the case at hand. 4 + 5 + ## Main fetaures 6 + 7 + ### Download using HTTP range requests 8 + 9 + In order to complete downloads from slow and unstable servers, the download should be done in “chunks” using [HTTP range requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests). This does not rely on long-standing HTTP connections, and it makes it predictable the idea of how long is too long for a non-response. 10 + 11 + ### Retries by chunk, not by file 12 + 13 + In order to be quicker and avoid rework, the primary way to handle failure is to retry that “chunk” (that bytes range), not the whole file. 14 + 15 + ### Control of which chunks are already downloaded 16 + 17 + In order to avoid re-starting from the beginning in case of non-handled errors, `chunk` knows which ranges from each file were already downloaded; so, when restarted, it only downloads what is really needed to complete the downloads. 18 + 19 + ### Detect server failures and give it a break 20 + 21 + In order to avoid unnecessary stress on the server, `chunk` relies not only on HTTP responses but also on other signs that the connection is stale and can: 22 + 23 + 1. recover from that and 24 + 2. give the server some time to recover from stress. 25 + 26 + ## Tech design 27 + 28 + ### Input 29 + 30 + * List of URLs 31 + * Directory where to save the files 32 + * Configuration (they can have defaults and be optional; customizing them can be a stretch goal): 33 + * Chunck download attempt timeout 34 + * Maximum parallel connection to each server 35 + * Max retries per chunk (must have an option to unlimited) 36 + * Range maximum size (chunk size) 37 + * Time to wait on server failure 38 + 39 + ### Prepare downloads 40 + 41 + For each URL of the list (this can be done in parallel): 42 + 43 + * Make sure the server accepts HTTP range requests (stretch goal) 44 + * Can fail if it doesn't 45 + * Or can default to regular HTTP request to download 46 + * Find out the file total size 47 + * Determine all the chunks to be downloaded (each start and end bytes) 48 + * Read or create a temporary control of chunks downloaded and pending chunks 49 + * Enqueue all the pending chunks 50 + 51 + With all this information, show a progress bar with the total work remaining. 52 + 53 + ### Download 54 + 55 + * Set a timeout 56 + * Start the HTTP range request 57 + * In case of failure or timeout, re-queue this chunk 58 + * In case of success, send the chunk contents to a `results` channel 59 + 60 + ### Writing files 61 + 62 + * Read the bytes from the `results` channel 63 + * Write to the file on disk 64 + * Update a progress bar to give the user an idea about the status of the downloads

Configure Feed

Configure Feed