Strategies for finding binary dependencies
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

update README

+17 -245
+17 -245
README.md
··· 5 5 6 6 # bindep — Strategies for finding binary dependencies 7 7 8 - _Vlad-Stefan Harbuz ([vlad.website][vlad]), Sep 2025_<br> 9 - 10 - Trying to make Open Source more [sustainable][sustainability], for example as part of initiatives like the [Open Source 11 - Endowment][endowment] and [thanks.dev][td], requires information about what dependencies are in a certain project's 12 - dependecy tree. For example, [React][react] depends on [eslint][eslint], and we know this because Javascript projects 13 - usually use manifest files that list dependencies and where to find them. In React's case, that's a 14 - [packages.json][react-manifest] file, like with most Javascript projects. There are other such manifests for various 15 - ecosystems — `requirements.txt` and `pyproject.toml` for Python, `go.mod` for Go, `Cargo.toml` for Rust and so on. 16 - 17 - These kinds of dependencies are _source dependencies_ — each of these manifest files point to where a dependency's 18 - source code can be obtained, and this source code is then downloaded and compiled or interpreted along with the main 19 - project's code. 20 - 21 - But there's also a different kind of dependency: _binary dependencies_. Instead of including dependencies' _source code_ 22 - as part of compilation/interpretation, some projects expect to be able to find _compiled binary forms_ of each of their 23 - dependencies. In order to make use of these dependencies, a project must know where each dependency's compiled binary is 24 - on the system, which symbols within that binary it would like to use (~function names etc), as well as the [ABI][abi] in 25 - use, which are all given to a [linker][linker] or [FFI][ffi] mechanism (like [cffi][cffi]) that correctly wires up eg 26 - calls to functions located within dependencies. 27 - 28 - Using binary dependencies is common in languages such as C and C++. But, when it comes to reconstructing dependency 29 - trees, this is a problem, because projects that use binary dependencies typically do not have a manifest file. This 30 - makes binary dependencies very difficult to identify. 31 - 32 - But there is still often other information we can use to reconstruct dependency trees. For example, projects that use 33 - binary dependencies also often use some kind of build system, and each build system has its own build recipe file — 34 - [CMake][cmake] uses [`CMakeLists.txt`][cmake-file], [Meson][meson] uses [`meson.build`][meson-file] and so on. And 35 - information about dependencies can also sometimes be gleaned from files that describe infrastructure, such as 36 - [Docker][docker]'s [`Dockerfile`][docker-file]. 37 - 38 - A further complication is that dependency trees sometimes span different ecosystems. [pandas][pandas] depends on 39 - [numpy][numpy], and both are Python projects. But numpy depends on a variety of libraries that implement [Basic Linear 40 - Algebra Subprograms][blas], and those libraries are written not in Python, but C or C++. So to fully work out pandas's 41 - dependency tree, we need to identify _binary dependencies_ from _different ecosystems_ than Python. 42 - 43 - Another thing to take into account is that some dependencies are optional. numpy can use one of various BLAS libraries, 44 - like [OpenBLAS][openblas], [flexiblas][flexiblas], [LAPACK][lapack] or [Intel MKL][mkl]. Not all of these dependencies 45 - are “hard” dependencies, because only one of these BLAS implementations is needed. A well-constructed dependency tree 46 - should incorporate this information. 47 - 48 - And it is desirable to have a solution that can construct dependency trees for a wide range of arbitrary 49 - never-before-seen repositories; autonomously, so without manual intervention; and at large scales, covering as many 50 - projects as possible. These are prerequisites for solutions than might be used to create a model of dependencies across 51 - the global Open Source ecosystem, sampling as many projects as possible. 52 - 53 - There are various strategies that might meet the above requirements. This document details various possible strategies 54 - for getting binary dependency information for use in a dependency tree, along with their pros and cons. 55 - 56 - Some strategies are marked as _infeasible_, meaning that they have limitations that prevent them from being used 57 - as a general solution, but discussing them is still interesting and informative. 58 - 59 - ## Collaboration 60 - 61 - Solving this problem should be a collective effort, so feel free to contribute your thoughts by 62 - [opening an issue][issues], 63 - [submitting a pull request][prs], 64 - or emailing me at [vlad@vlad.website](mailto:vlad@vlad.website). 65 - 66 - Consider checking out the projects that are trying to solve Open Source sustainability problems: 67 - 68 - * [Open Source Endowment][endowment] 69 - * [Open Source Pledge][pledge] 70 - * [thanks.dev][td] 71 - 72 - ## Table of contents 73 - 74 - Strategies — Feasible: 75 - 76 - * [Statically analyse build recipes](#statically-analyse-build-recipes) 77 - * [Infer dependencies from symbols extracted from binaries](#infer-dependencies-from-symbols-extracted-from-binaries) 78 - 79 - Strategies — Infeasible: 80 - 81 - * [Patch build tools](#patch-build-tools) 82 - * [Use infrastructure recipes](#use-infrastructure-recipes) 83 - * [Create a new standard](#create-a-new-standard) 84 - 85 - ## Strategies — Feasible 86 - 87 - This section contains the strategies I have identified that might meet the above requirements. 88 - 89 - ### Statically analyse build recipes 90 - 91 - Build tool recipes generally have some way to specify a dependency, and these specifications are then read by the build 92 - tool itself. For example, in Meson, dependencies are [specified][meson-deps] by writing something like 93 - `dependency('zlib', version : '>=1.2.8')`. 94 - 95 - One might think a trivial static analysis, such as simply grepping for `dependency('\([a-zA-Z0-9-_]+\)'`, would get us 96 - the dependencies, but consider the following excerpt from [numpy's `meson.build`][meson-file]: 97 - 98 - ``` 99 - foreach _name : blas_order 100 - if _name == 'mkl' 101 - blas = dependency('mkl', 102 - modules: ['cblas'] + blas_interface + mkl_opts, 103 - required: false, # may be required, but we need to emit a custom error message 104 - version: mkl_version_req, 105 - ) 106 - if not blas.found() and mkl_may_use_sdl 107 - blas = dependency('mkl', modules: ['cblas', 'sdl: true'], required: false) 108 - endif 109 - else 110 - if _name == 'flexiblas' and use_ilp64 111 - _name = 'flexiblas64' 112 - endif 113 - blas = dependency(_name, modules: ['cblas'] + blas_interface, required: false) 114 - endif 115 - if blas.found() 116 - break 117 - endif 118 - endforeach 119 - ``` 120 - 121 - Clearly, the syntax of build recipe files is complex enough to require actual parsing and evaluation. 122 - 123 - Such a static analysis of build recipes is possible, though. Lightweight interpreters for build recipes already exist, 124 - such as [parse.c][muon-parser] from [muon][muon], which is a lightweight Meson implementation. In fact, meson itself 125 - provides an [IntrospectionInterpreter][meson-introspection] capable of identifying dependencies. Such interpreters could 126 - be used to turn build recipes into [AST][ast]s, which can then be evaluated using custom rules that do nothing but 127 - collect the names of all dependencies referred to in the build recipe. 128 - 129 - Of course, such a parser-evaluator would have to be written for each build system, but once the most popular build 130 - systems, such as CMake and Meson, are covered, it seems likely that the dependencies of a good proportion of C/C++ 131 - projects could be reconstructed. 132 - 133 - There are still caveats and limitations. Most likely, this approach would only yield the _names_ of dependencies, and 134 - not necessarily the URLs to their repositories, so we would have to build an index containing, for each name, the most 135 - likely repository or repositories to be associated with that name. 136 - 137 - ✨ **Implementation:** I've started an implementation of this approach in the [meson](./meson) directory. 138 - 139 - ### Infer dependencies from symbols extracted from binaries 8 + A codebase might depend on another project's source code; or it might depend on another project's compiled binaries. 9 + Source code dependency relationships are mostly easy to identify; binary dependency relationships are not. We need to 10 + identify binary dependency relationships to ensure the Open Source ecosystem is secure and sustainably funded. 140 11 141 - Code that has binary dependencies calls into these dependencies using specific symbols. For example, numpy might 142 - look at the compiled dynamic library `libscipy_openblas64_-8fb3d286.so` for the symbol 143 - `scipy_openblas_set_num_threads64_`. 12 + This project aims to provide tools that enable us to identify binary dependency relationships. 144 13 145 - How can we identify that numpy depends on `openblas64`? Searching for the `.so` dynamic library file is not reliable, 146 - not only because its filename is not predictable, but also because the calling code does not need to call into a 147 - dynamically linked `openblas64.so` file — the `openblas64` code could even be statically compiled into the same binary 148 - as the calling code. But the symbols that a library is made up of, such as `scipy_openblas_set_num_threads64_`, _would_ 149 - probably collectively correctly identify the library. 14 + Detailed proposal 15 + : [Bindep, a Binary Dependency Discovery System][proposal] 150 16 151 - This strategy is very universally applicable. It would, however, require building some kind of index mapping symbols to 152 - the libraries they belong to. 153 - 154 - ✨ **Implementation:** For a lot more detail on this strategy, see [ecosyste-ms/packages#1261][eco-1261] 155 - 156 - ## Strategies — Infeasible 157 - 158 - This section contains strategies that I think are interesting and informative, but will not meet our needs on their own. 17 + See the 2026 FOSDEM talk 18 + : [Binary Dependencies: Identifying the Hidden Packages We All Depend On][fosdem-talk] 159 19 160 - ### Patch build tools 161 - 162 - Instead of going through the trouble of writing code to statically analyse build recipes ([see 163 - above](#statically-analyse-build-recipes)), one could make use of a build recipe parser that already exists — the build 164 - system itself. One could patch the build system so that, whenever a dependency specification is encountered, that 165 - dependency is printed in some convenient way, in addition to the normal build process. 20 + See also 21 + : [Connecting the dots between system package managers and language package managers][packages1261] 166 22 167 - In fact, this may not even require patching. CMake will print a list of encountered dependencies when `CMakeLists.txt` 168 - specifies `set_property(GLOBAL PROPERTY GLOBAL_DEPENDS_DEBUG_MODE 1)`. And CMake can even print out an illustration 169 - containing a graph of dependencies, when called using `cmake --graphviz=graph.dot ...`. This is hopeful, since CMake is 170 - probably the most widely used C/C++ build system. 23 + ## Usage 171 24 172 - But this strategy is infeasible because it requires _actually building_ the project we're trying to get a dependency 173 - tree for. In addition to being computationally intensive and having unknown side effects, most projects simply cannot be 174 - autonomously built, because they require manual intervention such as config files being written, packages being manually 175 - installed, and other prerequisites. So although this approach is interesting and informative, it is not sufficient. 25 + This repository will contain some programs. They are currently being written. Check back! 176 26 177 - ### Use infrastructure recipes 27 + ## Authorship 178 28 179 - Infrastructure recipes such as `Dockerfile`s specify the dependencies that must be installed for a project to work, 180 - including binary dependencies. However, these dependencies can be specified in many different ways. Consider this 181 - excerpt from [linkding][linkding]'s [`Dockerfile`][docker-file]: 29 + Vlad-Stefan Harbuz ([vlad.website][vlad]) unless otherwise noted. 182 30 183 - ``` 184 - RUN apt-get update && apt-get -y install build-essential pkg-config libpq-dev libicu-dev libsqlite3-dev wget unzip libffi-dev libssl-dev curl 185 - ... 186 - # install uv, use installer script for now as distroless images are not availabe for armv7 187 - ADD https://astral.sh/uv/0.8.13/install.sh /uv-installer.sh 188 - ... 189 - COPY pyproject.toml uv.lock ./ 190 - RUN /root/.local/bin/uv sync --no-dev --group postgres 191 - ... 192 - ARG SQLITE_RELEASE_YEAR=2023 193 - ARG SQLITE_RELEASE=3430000 194 - ... 195 - RUN wget https://www.sqlite.org/${SQLITE_RELEASE_YEAR}/sqlite-amalgamation-${SQLITE_RELEASE}.zip && \ 196 - unzip sqlite-amalgamation-${SQLITE_RELEASE}.zip && \ 197 - cp sqlite-amalgamation-${SQLITE_RELEASE}/sqlite3.h ./sqlite3.h && \ 198 - cp sqlite-amalgamation-${SQLITE_RELEASE}/sqlite3ext.h ./sqlite3ext.h && \ 199 - wget https://www.sqlite.org/src/raw/ext/icu/icu.c?name=91c021c7e3e8bbba286960810fa303295c622e323567b2e6def4ce58e4466e60 -O icu.c && \ 200 - gcc -fPIC -shared icu.c `pkg-config --libs --cflags icu-uc icu-io` -o libicu.so 201 - ... 202 - RUN apt-get update && apt-get -y install media-types libpq-dev libicu-dev libssl3t64 curl 203 - ``` 204 - 205 - This excerpt contains information about many dependencies such as `libpq` and `libssl`. But parsing this recipe is 206 - problematic in many ways: 207 - 208 - * It is not straightforward to parse package manager commands such as those for `apt`, especially when many different 209 - package managers are used across distributions 210 - * The same project can be packaged under many different names in many different package managers — although this could 211 - be solved by building an index of such names and using heuristics 212 - * It is not at all straightforward to parse non-package-manager installation steps such as `wget`, `unzip` etc 213 - 214 - And in any case, not all projects will have a `Dockerfile`. So the strategy of using infrastructure recipes has serious 215 - limitations. 216 - 217 - ### Create a new standard 218 - 219 - Instead of attempting to glean information from sources that were not made to be parsed in this way, such as 220 - `CMakeLists.txt`, it might be best to _create a specification_ for a new manifest file format to be used in projects 221 - that make use of binary dependencies. Such a file format would allow developers to specify binary dependencies in a 222 - generally machine-readable format, which would make such dependencies easier to parse, in a way that is understood by 223 - everyone. Ideally, such a specification would somehow easily interoperate with existing build tools if this is required 224 - or useful. 225 - 226 - While this is probably a good idea, it would require widespread adoption, which is not feasible in the short-term, so 227 - this strategy would not help us meet our Open Source sustainability goals anytime soon. 228 - 229 - [abi]: https://en.wikipedia.org/wiki/Application_binary_interface 230 - [ast]: https://en.wikipedia.org/wiki/Abstract_syntax_tree 231 - [blas]: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms 232 - [cffi]: https://cffi.readthedocs.io/en/stable/ 233 - [cmake-file]: https://github.com/ClickHouse/ClickHouse/blob/master/CMakeLists.txt 234 - [cmake]: https://cmake.org/cmake/help/latest/manual/cmake.1.html 235 - [docker-file]: https://github.com/sissbruecker/linkding/blob/master/docker/default.Dockerfile 236 - [docker]: https://www.docker.com/ 237 - [eco-1261]: https://github.com/ecosyste-ms/packages/issues/1261 238 - [endowment]: https://endowment.dev 239 - [eslint]: https://eslint.org/ 240 - [ffi]: https://en.wikipedia.org/wiki/Foreign_function_interface 241 - [flexiblas]: https://github.com/mpimd-csc/flexiblas 242 - [issues]: https://codeberg.org/vladh/bindep/issues 243 - [lapack]: https://www.netlib.org/lapack/ 244 - [linkding]: https://github.com/sissbruecker/linkding 245 - [linker]: https://en.wikipedia.org/wiki/Linker_(computing) 246 - [meson-deps]: https://mesonbuild.com/Dependencies.html 247 - [meson-file]: https://github.com/numpy/numpy/blob/main/numpy/meson.build 248 - [meson-introspection]: https://github.com/mesonbuild/meson/blob/master/mesonbuild/ast/introspection.py 249 - [meson]: https://mesonbuild.com/ 250 - [mkl]: https://docs.cirrus.ac.uk/software-libraries/intel_mkl/ 251 - [muon-parser]: https://git.sr.ht/~lattis/muon/tree/master/item/src/lang/parser.c 252 - [muon]: https://git.sr.ht/~lattis/muon 253 - [numpy]: https://github.com/numpy/numpy 254 - [openblas]: https://github.com/OpenMathLib/OpenBLAS 255 - [pandas]: https://github.com/pandas-dev/pandas 256 - [pledge]: https://opensourcepledge.com 257 - [prs]: https://codeberg.org/vladh/bindep/pulls 258 - [react-manifest]: https://github.com/facebook/react/blob/main/package.json 259 - [react]: https://github.com/facebook/react 260 - [sustainability]: https://openpath.quest/2024/the-open-source-sustainability-crisis/ 261 - [td]: https://thanks.dev 31 + [fosdem-talk]: https://fosdem.org/2026/schedule/event/7NQJNU-binary_dependencies_identifying_the_hidden_packages_we_all_depend_on/ 32 + [packages1261]: https://github.com/ecosyste-ms/packages/issues/1261 33 + [proposal]: https://hackmd.io/@vladh/binary-dependencies 262 34 [vlad]: https://vlad.website