📝 Move IE note · davidgasquez.com/handbook@4467545

+56 -51

1 changed file

expand all

+56 -51

Impact Evaluators.md

··· 1 1 # Impact Evaluators 2 2 3 + Impact Evaluators are frameworks for [[Coordination|coordinating]] work and aligning [[Incentives]] in complex [[Systems]]. They provide mechanisms for retrospectively evaluating and rewarding contributions based on actual impact, helping solve coordination problems in [[Public Goods Funding]], research evaluation, and decentralized systems. 4 + 3 5 It's hard to do [[Public Goods Funding]], open-source software, research, etc. that don't have a clear, immediate financial return, especially high-risk/high-reward projects. 4 6 5 7 Traditional funding often fails here. Instead of just giving money upfront (prospectively), Impact Evaluators create systems that look back at what work was actually done and what impact it actually had (retrospectively). It's much easier to judge the impact in a retrospective way! ··· 20 22 - Focus on positive sum games and mechanisms. 21 23 - E.g: OSO's "developer count" requires +5 commits to be counted. You might or might not align with that metric. 22 24 - IEs, as most systems should have a deadline or something like that so it fades away if it's not working. 25 + - Fix rules to keep things simple and easy to play. Opinionated framework with sane defaults! 23 26 - [IEs are the scientific method in disguise like AI evals](https://eugeneyan.com/writing/eval-process/). You need automated IEs, which is basically science applied to building better systems. You also need human oversight. 24 - - For optimization tasks with continuous output, follow bittensor model. 27 + - For areas with continuous output (e.g: minting for "better path finding algorightms"), follow Bittensor model. 25 28 - IEs are like nuclear power: extremely powerful if used correctly, but so very easy to get wrong, and when things go wrong the whole thing blows up in your face. 29 + - **Start local and iterate**. Begin with small communities defining their own [[Metrics]] and evaluation criteria. Use rapid [[Feedback Loops]] to learn what works before scaling up. 30 + - Each community understands its context better than outsiders ([seeing like a state blinds you to local realities](https://slatestarcodex.com/2017/03/16/book-review-seeing-like-a-state/)) 31 + - Local experiments surface patterns for higher-level systems 32 + - Small groups enable iterated games that reward trust and penalize defection 33 + - Reduced size reduce friction 34 + - **Build anti-Goodhart resilience**. Any metric used for decisions [becomes subject to gaming pressures](https://en.wikipedia.org/wiki/Campbell%27s_law). Design for evolution: 35 + - Run multiple evaluation algorithms in parallel and let humans choose 36 + - Use exploration/exploitation trade-offs (like multi-armed bandits) to test new metrics 37 + - Make the meta-layer for evaluating evaluators explicit 38 + - **Separate data from judgment**. [Impact Evaluators work like data-driven organizations](https://handbook.davidgasquez.com/data/data-culture): 39 + - Gather objective attestations about work (commits, usage stats, dependencies) 40 + - Apply multiple "evaluation lenses" to interpret the data 41 + - Let funders choose which lenses align with their values 42 + - **Design for composability**. Define clear data structures (graphs, weight vectors) as APIs between layers. This enables: 43 + - Multiple communities to share measurement infrastructure 44 + - Different evaluation methods to operate on the same data 45 + - Evolution through recombination rather than redesign 26 46 - We might be in an "Arrow's Impossibility Theorem" situation where there is no way to design a mechanism that is fair, efficient and incentive compatible. 27 47 - There is no "end of history" because whenever you fix an evaluation, some group has an incentive to abuse or break it again and feast on the wreckage. 28 48 - This is the formal impossibility theorem that no mechanism can simultaneously achieve four desirable criteria: ··· 47 67 - Voting on models: feels too abstract for voters and doesn't leverage their specific project expertise 48 68 - Voting on metrics: judges just play with numbers until they get their favored allocation 49 69 - Voting directly on projects: halo effect, peanut butter distributions, heavy operational workload 70 + - **Incomplete contracts problem**. [It's expensive to measure what really matters](https://meaningalignment.substack.com/p/market-intermediaries-a-post-agi), so we optimize proxies that drift from true goals. 71 + - Current markets optimize clicks and engagement over human flourishing 72 + - The more powerful the optimization, the more dangerous the misalignment 73 + - **Information elicitation without verification**. Getting truthful data from subjective evaluation when you can't verify it requires clever [[Mechanism Design]]: 74 + - [Peer prediction mechanisms](https://jonathanwarden.com/information-elicitation-mechanisms/) that reward agreement with hidden samples 75 + - [Bayesian Truth Serum](https://www.science.org/doi/10.1126/science.1102081) that uses both answers and predictions 76 + - Coordination games where truth serves as a Schelling point 77 + - **Collusion resistance**. Any mechanism helping under-coordinated parties will also help [over-coordinated parties extract value](https://vitalik.eth.limo/general/2019/04/03/collusion.html). Countermeasures include: 78 + - Identity-free incentives (like proof-of-work) 79 + - Fork-and-exit rights for minorities 80 + - Privacy pools that exclude provably malicious actors 81 + - Multiple independent "dashboard organizations" preventing capture 82 + - They should be flexible as it's hard to predict ways the evaluation metrics will be gamed. 50 83 - An allocation mechanism can be seen as a measurement process, with the goal being the reduction of uncertainty concerning present beliefs about the future. An effective process will gather and leverage as much information as possible while maximizing the signal-to-noise ratio of that information — aims which are often at odds. 51 84 - In the digital world, we can apply several techniques to the same input and evaluate the potential impacts. E.g: Simulate different voting systems and see which one fits the best with the current views. This is a case for the system to have a final mechanism that acts as a layer for human to express preferences. 52 85 - [Every community and institutions wants to see a better, more responsive and dynamic provision of public goods within them, usually lack information about which goods have the greatest value and know quite a bit about social structure internally which would allow them to police the way GitCoin has in the domains it knows](https://gov.gitcoin.co/t/a-vision-for-a-pluralistic-civilizational-scale-infrastructure-for-funding-public-goods/9503/11). ··· 55 88 - Can open data be rewarded with an IE? What does a block reward mean there? 56 89 - Seeing like a State blinds you to the realities that are complex. Need a way to evolve the metric to be anti-Goodhart's. 57 90 - Not even anti-goodharts. Research says the best thing to do is to give all money to vaccine distribution, ... 58 - - Run multiple "aggregations" algorithms and have humans blindly select which one they prefer (blind test). 91 + - **Embrace plurality over perfection**. [No single mechanism can satisfy all desirable properties](https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem) (efficiency, fairness, incentive compatibility, budget balance). Different contexts need different trade-offs. 92 + - **Make evaluation infrastructure permissionless**. Just as anyone can fork code, anyone should be able to fork evaluation criteria. This prevents capture and enables innovation. 93 + - **Focus on error analysis**. Like in [LLM evaluations](https://hamel.dev/blog/posts/evals-faq/), understanding failure modes matters more than optimizing metrics. Study what breaks and why. 94 + - **Layer human judgment on algorithmic engines**. The ["engine and steering wheel" pattern](https://vitalik.eth.limo/general/2025/02/28/aihumans.html) - let algorithms handle scale while humans set direction and audit results. 95 + - The easier to verify the solution is (e.g: verify a program passes the test vs verify the experiment replicates), the better and faster the IE can be. 96 + - If the domain of the IE is sortable and differentiable, it's easy as it can be seen as pure optimization and doesn't require humans subjective evaluation. 97 + - Verify the evaluation is actually better than the baseline. 98 + - Run multiple "aggregations" algorithms and have humans blindly select which one they prefer (blind test). 59 99 - The meta-layer can help compose and evaluate mechanisms. How do we know mechanism B is better than A? Or even better than A + B, how do we evolve things? 60 100 - Reinforcement Learning? 101 + - Genetic algorithms? 102 + - Is the evaluation/reward better than a centralized/simpler alternative? 103 + - E.g: on tabular clinical prediction datasets, standard logistic regression was found to be on par with deep recurrent models 61 104 - [IEs need to show how the solution is produced by the interactions of people each of whom possesses only partial knowledge](https://news.ycombinator.com/item?id=44232461). 62 - - Bandit Algorithms? 105 + - IEs are optimization processes with tend to exploit (more impact, more reward). This ends up with a monopoly (100% exploit). You probably want to always have some exploration. 63 106 - Do IEs need some explore/exploit thing? E.g. Use multi-armed bandit algorithms to adaptively choose between evaluation mechanisms based on historical performance and context. 64 107 - Use maximal lotteries to enforce the exploration 65 108 - Having discrete rounds simplify the process. Like a batch pipeline. 66 109 - The more humans gets involved, the messier (papers, ... academia). You cannot get away from humans in most problems. 67 - 68 - Impact Evaluators are frameworks for [[Coordination|coordinating]] work and aligning [[Incentives]] in complex [[Systems]]. They provide mechanisms for retrospectively evaluating and rewarding contributions based on actual impact, helping solve coordination problems in [[Public Goods Funding]], research evaluation, and decentralized systems. 110 + - [Campbell's Law](https://en.wikipedia.org/wiki/Campbell%27s_law). The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor. 111 + - [The McNamara Fallacy](https://en.wikipedia.org/wiki/McNamara_fallacy). Never choose metrics on the basis of what is easily measurable over what is meaningful. Data is inherently objectifying and naturally reduces complex conceptions and process into coarse representations. There’s a certain fetish for data that can be quantified. 112 + - IEs should define also a Data Structure for each layer so they can compose (graph, weight vector). That is the API. 113 + - E.g: Deepfunding problem data structure is a graph. Weights are a vector/dict, ... 114 + - IEs will have to do some sort of "error analysis". [Is the most important activity in LLM evals](https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed). Error analysis helps you decide what evals to write in the first place. It allows you to identify failure modes unique to your application and data. 115 + - Film festivals are "local" IEs each one serving different values/communities. 69 116 70 117 ## Principles 71 118 ··· 100 147 - Process Control Theory 101 148 - LLM Evals 102 149 103 - ## Design Considerations 104 - 105 - - **Start local and iterate**. Begin with small communities defining their own [[Metrics]] and evaluation criteria. Use rapid [[Feedback Loops]] to learn what works before scaling up. 106 - - Each community understands its context better than outsiders ([seeing like a state blinds you to local realities](https://slatestarcodex.com/2017/03/16/book-review-seeing-like-a-state/)) 107 - - Local experiments surface patterns for higher-level systems 108 - - Small groups enable iterated games that reward trust and penalize defection 109 - - Reduced size reduce friction 110 - - **Build anti-Goodhart resilience**. Any metric used for decisions [becomes subject to gaming pressures](https://en.wikipedia.org/wiki/Campbell%27s_law). Design for evolution: 111 - - Run multiple evaluation algorithms in parallel and let humans choose 112 - - Use exploration/exploitation trade-offs (like multi-armed bandits) to test new metrics 113 - - Make the meta-layer for evaluating evaluators explicit 114 - - **Separate data from judgment**. [Impact Evaluators work like data-driven organizations](https://handbook.davidgasquez.com/data/data-culture): 115 - - Gather objective attestations about work (commits, usage stats, dependencies) 116 - - Apply multiple "evaluation lenses" to interpret the data 117 - - Let funders choose which lenses align with their values 118 - - **Design for composability**. Define clear data structures (graphs, weight vectors) as APIs between layers. This enables: 119 - - Multiple communities to share measurement infrastructure 120 - - Different evaluation methods to operate on the same data 121 - - Evolution through recombination rather than redesign 122 - 123 - ## Implementation Challenges 124 - 125 - - **Incomplete contracts problem**. [It's expensive to measure what really matters](https://meaningalignment.substack.com/p/market-intermediaries-a-post-agi), so we optimize proxies that drift from true goals. 126 - - Current markets optimize clicks and engagement over human flourishing 127 - - The more powerful the optimization, the more dangerous the misalignment 128 - - **Information elicitation without verification**. Getting truthful data from subjective evaluation when you can't verify it requires clever [[Mechanism Design]]: 129 - - [Peer prediction mechanisms](https://jonathanwarden.com/information-elicitation-mechanisms/) that reward agreement with hidden samples 130 - - [Bayesian Truth Serum](https://www.science.org/doi/10.1126/science.1102081) that uses both answers and predictions 131 - - Coordination games where truth serves as a Schelling point 132 - - **Collusion resistance**. Any mechanism helping under-coordinated parties will also help [over-coordinated parties extract value](https://vitalik.eth.limo/general/2019/04/03/collusion.html). Countermeasures include: 133 - - Identity-free incentives (like proof-of-work) 134 - - Fork-and-exit rights for minorities 135 - - Privacy pools that exclude provably malicious actors 136 - - Multiple independent "dashboard organizations" preventing capture 137 - - They should be flexible as it's hard to predict ways the evaluation metrics will be gamed. 138 - 139 150 ## Mechanism Toolkit 140 151 141 152 - **Staking and slashing**. Require deposits that get burned for misbehavior. Simple but requires upfront capital. ··· 154 165 - **Token-curated registries (TCRs)**. Stakeholders deposit tokens to curate lists; challengers and voters decide on inclusions, with slashing/redistribution to discourage bad entries. 155 166 - **Deliberative protocols**. [Structured discussion processes](https://jonathanwarden.com/deliberative-consensus-protocols/) that surface information before voting. 156 167 157 - ## The Path Forward 158 - 159 - - **Embrace plurality over perfection**. [No single mechanism can satisfy all desirable properties](https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem) (efficiency, fairness, incentive compatibility, budget balance). Different contexts need different trade-offs. 160 - - **Make evaluation infrastructure permissionless**. Just as anyone can fork code, anyone should be able to fork evaluation criteria. This prevents capture and enables innovation. 161 - - **Focus on error analysis**. Like in [LLM evaluations](https://hamel.dev/blog/posts/evals-faq/), understanding failure modes matters more than optimizing metrics. Study what breaks and why. 162 - - **Layer human judgment on algorithmic engines**. The ["engine and steering wheel" pattern](https://vitalik.eth.limo/general/2025/02/28/aihumans.html) - let algorithms handle scale while humans set direction and audit results. 163 - 164 - Impact Evaluators are powerful but dangerous. Like nuclear reactors, they can solve major [[Coordination]] problems when designed well, but cascade failures are catastrophic. Start small, fail safely, and always maintain [credible exit options](https://newsletter.squishy.computer/p/soulbinding-like-a-state). 165 - 166 168 ## Ideas 167 169 168 - ### Plurality Impact Evaluators 170 + ### Plurality Lens Impact Evaluators 169 171 170 172 A federated network or ecosystem of IEs built on a shared, transparent substrate (blockchain). Different communities ("Impact Pods") define their own scopes and objectives, leverage diverse measurement tools, and are evaluated through multiple, competing "Evaluation Lenses." Funding flows through dedicated pools linked to these Pods and Lenses. 171 173 ··· 219 221 - [Tournament Theory: Thirty Years of Contests and Competitions](https://www.researchgate.net/publication/275441821_Tournament_Theory_Thirty_Years_of_Contests_and_Competitions) 220 222 - [Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf) 221 223 - [Asymmetry of verification and verifier's law](https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law) 224 + - [Ostrom's Common Pool Resource Management](https://earthbound.report/2018/01/15/elinor-ostroms-8-rules-for-managing-the-commons/) 225 + - [Community Notes Note ranking algorithm](https://communitynotes.x.com/guide/en/under-the-hood/ranking-notes) 226 + - [Deep Funding is a Special Case of Generalized Impact Evaluators](https://hackmd.io/@dwddao/HypnqpQKke)

Configure Feed

Configure Feed