···11# Impact Evaluators
2233+Impact Evaluators are frameworks for [[Coordination|coordinating]] work and aligning [[Incentives]] in complex [[Systems]]. They provide mechanisms for retrospectively evaluating and rewarding contributions based on actual impact, helping solve coordination problems in [[Public Goods Funding]], research evaluation, and decentralized systems.
44+35It's hard to do [[Public Goods Funding]], open-source software, research, etc. that don't have a clear, immediate financial return, especially high-risk/high-reward projects.
4657Traditional funding often fails here. Instead of just giving money upfront (prospectively), Impact Evaluators create systems that look back at what work was actually done and what impact it actually had (retrospectively). It's much easier to judge the impact in a retrospective way!
···2022 - Focus on positive sum games and mechanisms.
2123 - E.g: OSO's "developer count" requires +5 commits to be counted. You might or might not align with that metric.
2224- IEs, as most systems should have a deadline or something like that so it fades away if it's not working.
2525+- Fix rules to keep things simple and easy to play. Opinionated framework with sane defaults!
2326- [IEs are the scientific method in disguise like AI evals](https://eugeneyan.com/writing/eval-process/). You need automated IEs, which is basically science applied to building better systems. You also need human oversight.
2424-- For optimization tasks with continuous output, follow bittensor model.
2727+- For areas with continuous output (e.g: minting for "better path finding algorightms"), follow Bittensor model.
2528- IEs are like nuclear power: extremely powerful if used correctly, but so very easy to get wrong, and when things go wrong the whole thing blows up in your face.
2929+- **Start local and iterate**. Begin with small communities defining their own [[Metrics]] and evaluation criteria. Use rapid [[Feedback Loops]] to learn what works before scaling up.
3030+ - Each community understands its context better than outsiders ([seeing like a state blinds you to local realities](https://slatestarcodex.com/2017/03/16/book-review-seeing-like-a-state/))
3131+ - Local experiments surface patterns for higher-level systems
3232+ - Small groups enable iterated games that reward trust and penalize defection
3333+ - Reduced size reduce friction
3434+- **Build anti-Goodhart resilience**. Any metric used for decisions [becomes subject to gaming pressures](https://en.wikipedia.org/wiki/Campbell%27s_law). Design for evolution:
3535+ - Run multiple evaluation algorithms in parallel and let humans choose
3636+ - Use exploration/exploitation trade-offs (like multi-armed bandits) to test new metrics
3737+ - Make the meta-layer for evaluating evaluators explicit
3838+- **Separate data from judgment**. [Impact Evaluators work like data-driven organizations](https://handbook.davidgasquez.com/data/data-culture):
3939+ - Gather objective attestations about work (commits, usage stats, dependencies)
4040+ - Apply multiple "evaluation lenses" to interpret the data
4141+ - Let funders choose which lenses align with their values
4242+- **Design for composability**. Define clear data structures (graphs, weight vectors) as APIs between layers. This enables:
4343+ - Multiple communities to share measurement infrastructure
4444+ - Different evaluation methods to operate on the same data
4545+ - Evolution through recombination rather than redesign
2646- We might be in an "Arrow's Impossibility Theorem" situation where there is no way to design a mechanism that is fair, efficient and incentive compatible.
2747 - There is no "end of history" because whenever you fix an evaluation, some group has an incentive to abuse or break it again and feast on the wreckage.
2848 - This is the formal impossibility theorem that no mechanism can simultaneously achieve four desirable criteria:
···4767 - Voting on models: feels too abstract for voters and doesn't leverage their specific project expertise
4868 - Voting on metrics: judges just play with numbers until they get their favored allocation
4969 - Voting directly on projects: halo effect, peanut butter distributions, heavy operational workload
7070+- **Incomplete contracts problem**. [It's expensive to measure what really matters](https://meaningalignment.substack.com/p/market-intermediaries-a-post-agi), so we optimize proxies that drift from true goals.
7171+ - Current markets optimize clicks and engagement over human flourishing
7272+ - The more powerful the optimization, the more dangerous the misalignment
7373+- **Information elicitation without verification**. Getting truthful data from subjective evaluation when you can't verify it requires clever [[Mechanism Design]]:
7474+ - [Peer prediction mechanisms](https://jonathanwarden.com/information-elicitation-mechanisms/) that reward agreement with hidden samples
7575+ - [Bayesian Truth Serum](https://www.science.org/doi/10.1126/science.1102081) that uses both answers and predictions
7676+ - Coordination games where truth serves as a Schelling point
7777+- **Collusion resistance**. Any mechanism helping under-coordinated parties will also help [over-coordinated parties extract value](https://vitalik.eth.limo/general/2019/04/03/collusion.html). Countermeasures include:
7878+ - Identity-free incentives (like proof-of-work)
7979+ - Fork-and-exit rights for minorities
8080+ - Privacy pools that exclude provably malicious actors
8181+ - Multiple independent "dashboard organizations" preventing capture
8282+ - They should be flexible as it's hard to predict ways the evaluation metrics will be gamed.
5083- An allocation mechanism can be seen as a measurement process, with the goal being the reduction of uncertainty concerning present beliefs about the future. An effective process will gather and leverage as much information as possible while maximizing the signal-to-noise ratio of that information โ aims which are often at odds.
5184- In the digital world, we can apply several techniques to the same input and evaluate the potential impacts. E.g: Simulate different voting systems and see which one fits the best with the current views. This is a case for the system to have a final mechanism that acts as a layer for human to express preferences.
5285- [Every community and institutions wants to see a better, more responsive and dynamic provision of public goods within them, usually lack information about which goods have the greatest value and know quite a bit about social structure internally which would allow them to police the way GitCoin has in the domains it knows](https://gov.gitcoin.co/t/a-vision-for-a-pluralistic-civilizational-scale-infrastructure-for-funding-public-goods/9503/11).
···5588 - Can open data be rewarded with an IE? What does a block reward mean there?
5689- Seeing like a State blinds you to the realities that are complex. Need a way to evolve the metric to be anti-Goodhart's.
5790 - Not even anti-goodharts. Research says the best thing to do is to give all money to vaccine distribution, ...
5858-- Run multiple "aggregations" algorithms and have humans blindly select which one they prefer (blind test).
9191+- **Embrace plurality over perfection**. [No single mechanism can satisfy all desirable properties](https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem) (efficiency, fairness, incentive compatibility, budget balance). Different contexts need different trade-offs.
9292+- **Make evaluation infrastructure permissionless**. Just as anyone can fork code, anyone should be able to fork evaluation criteria. This prevents capture and enables innovation.
9393+- **Focus on error analysis**. Like in [LLM evaluations](https://hamel.dev/blog/posts/evals-faq/), understanding failure modes matters more than optimizing metrics. Study what breaks and why.
9494+- **Layer human judgment on algorithmic engines**. The ["engine and steering wheel" pattern](https://vitalik.eth.limo/general/2025/02/28/aihumans.html) - let algorithms handle scale while humans set direction and audit results.
9595+- The easier to verify the solution is (e.g: verify a program passes the test vs verify the experiment replicates), the better and faster the IE can be.
9696+- If the domain of the IE is sortable and differentiable, it's easy as it can be seen as pure optimization and doesn't require humans subjective evaluation.
9797+- Verify the evaluation is actually better than the baseline.
9898+ - Run multiple "aggregations" algorithms and have humans blindly select which one they prefer (blind test).
5999 - The meta-layer can help compose and evaluate mechanisms. How do we know mechanism B is better than A? Or even better than A + B, how do we evolve things?
60100 - Reinforcement Learning?
101101+ - Genetic algorithms?
102102+ - Is the evaluation/reward better than a centralized/simpler alternative?
103103+ - E.g: on tabular clinical prediction datasets, standard logistic regression was found to be on par with deep recurrent models
61104- [IEs need to show how the solution is produced by the interactions of people each of whom possesses only partial knowledge](https://news.ycombinator.com/item?id=44232461).
6262-- Bandit Algorithms?
105105+- IEs are optimization processes with tend to exploit (more impact, more reward). This ends up with a monopoly (100% exploit). You probably want to always have some exploration.
63106 - Do IEs need some explore/exploit thing? E.g. Use multi-armed bandit algorithms to adaptively choose between evaluation mechanisms based on historical performance and context.
64107 - Use maximal lotteries to enforce the exploration
65108- Having discrete rounds simplify the process. Like a batch pipeline.
66109- The more humans gets involved, the messier (papers, ... academia). You cannot get away from humans in most problems.
6767-6868-Impact Evaluators are frameworks for [[Coordination|coordinating]] work and aligning [[Incentives]] in complex [[Systems]]. They provide mechanisms for retrospectively evaluating and rewarding contributions based on actual impact, helping solve coordination problems in [[Public Goods Funding]], research evaluation, and decentralized systems.
110110+- [Campbell's Law](https://en.wikipedia.org/wiki/Campbell%27s_law). The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.
111111+- [The McNamara Fallacy](https://en.wikipedia.org/wiki/McNamara_fallacy). Never choose metrics on the basis of what is easily measurable over what is meaningful. Data is inherently objectifying and naturally reduces complex conceptions and process into coarse representations. Thereโs a certain fetish for data that can be quantified.
112112+- IEs should define also a Data Structure for each layer so they can compose (graph, weight vector). That is the API.
113113+ - E.g: Deepfunding problem data structure is a graph. Weights are a vector/dict, ...
114114+- IEs will have to do some sort of "error analysis". [Is the most important activity in LLM evals](https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed). Error analysis helps you decide what evals to write in the first place. It allows you to identify failure modes unique to your application and data.
115115+- Film festivals are "local" IEs each one serving different values/communities.
6911670117## Principles
71118···100147- Process Control Theory
101148- LLM Evals
102149103103-## Design Considerations
104104-105105-- **Start local and iterate**. Begin with small communities defining their own [[Metrics]] and evaluation criteria. Use rapid [[Feedback Loops]] to learn what works before scaling up.
106106- - Each community understands its context better than outsiders ([seeing like a state blinds you to local realities](https://slatestarcodex.com/2017/03/16/book-review-seeing-like-a-state/))
107107- - Local experiments surface patterns for higher-level systems
108108- - Small groups enable iterated games that reward trust and penalize defection
109109- - Reduced size reduce friction
110110-- **Build anti-Goodhart resilience**. Any metric used for decisions [becomes subject to gaming pressures](https://en.wikipedia.org/wiki/Campbell%27s_law). Design for evolution:
111111- - Run multiple evaluation algorithms in parallel and let humans choose
112112- - Use exploration/exploitation trade-offs (like multi-armed bandits) to test new metrics
113113- - Make the meta-layer for evaluating evaluators explicit
114114-- **Separate data from judgment**. [Impact Evaluators work like data-driven organizations](https://handbook.davidgasquez.com/data/data-culture):
115115- - Gather objective attestations about work (commits, usage stats, dependencies)
116116- - Apply multiple "evaluation lenses" to interpret the data
117117- - Let funders choose which lenses align with their values
118118-- **Design for composability**. Define clear data structures (graphs, weight vectors) as APIs between layers. This enables:
119119- - Multiple communities to share measurement infrastructure
120120- - Different evaluation methods to operate on the same data
121121- - Evolution through recombination rather than redesign
122122-123123-## Implementation Challenges
124124-125125-- **Incomplete contracts problem**. [It's expensive to measure what really matters](https://meaningalignment.substack.com/p/market-intermediaries-a-post-agi), so we optimize proxies that drift from true goals.
126126- - Current markets optimize clicks and engagement over human flourishing
127127- - The more powerful the optimization, the more dangerous the misalignment
128128-- **Information elicitation without verification**. Getting truthful data from subjective evaluation when you can't verify it requires clever [[Mechanism Design]]:
129129- - [Peer prediction mechanisms](https://jonathanwarden.com/information-elicitation-mechanisms/) that reward agreement with hidden samples
130130- - [Bayesian Truth Serum](https://www.science.org/doi/10.1126/science.1102081) that uses both answers and predictions
131131- - Coordination games where truth serves as a Schelling point
132132-- **Collusion resistance**. Any mechanism helping under-coordinated parties will also help [over-coordinated parties extract value](https://vitalik.eth.limo/general/2019/04/03/collusion.html). Countermeasures include:
133133- - Identity-free incentives (like proof-of-work)
134134- - Fork-and-exit rights for minorities
135135- - Privacy pools that exclude provably malicious actors
136136- - Multiple independent "dashboard organizations" preventing capture
137137- - They should be flexible as it's hard to predict ways the evaluation metrics will be gamed.
138138-139150## Mechanism Toolkit
140151141152- **Staking and slashing**. Require deposits that get burned for misbehavior. Simple but requires upfront capital.
···154165- **Token-curated registries (TCRs)**. Stakeholders deposit tokens to curate lists; challengers and voters decide on inclusions, with slashing/redistribution to discourage bad entries.
155166- **Deliberative protocols**. [Structured discussion processes](https://jonathanwarden.com/deliberative-consensus-protocols/) that surface information before voting.
156167157157-## The Path Forward
158158-159159-- **Embrace plurality over perfection**. [No single mechanism can satisfy all desirable properties](https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem) (efficiency, fairness, incentive compatibility, budget balance). Different contexts need different trade-offs.
160160-- **Make evaluation infrastructure permissionless**. Just as anyone can fork code, anyone should be able to fork evaluation criteria. This prevents capture and enables innovation.
161161-- **Focus on error analysis**. Like in [LLM evaluations](https://hamel.dev/blog/posts/evals-faq/), understanding failure modes matters more than optimizing metrics. Study what breaks and why.
162162-- **Layer human judgment on algorithmic engines**. The ["engine and steering wheel" pattern](https://vitalik.eth.limo/general/2025/02/28/aihumans.html) - let algorithms handle scale while humans set direction and audit results.
163163-164164-Impact Evaluators are powerful but dangerous. Like nuclear reactors, they can solve major [[Coordination]] problems when designed well, but cascade failures are catastrophic. Start small, fail safely, and always maintain [credible exit options](https://newsletter.squishy.computer/p/soulbinding-like-a-state).
165165-166168## Ideas
167169168168-### Plurality Impact Evaluators
170170+### Plurality Lens Impact Evaluators
169171170172A federated network or ecosystem of IEs built on a shared, transparent substrate (blockchain). Different communities ("Impact Pods") define their own scopes and objectives, leverage diverse measurement tools, and are evaluated through multiple, competing "Evaluation Lenses." Funding flows through dedicated pools linked to these Pods and Lenses.
171173···219221- [Tournament Theory: Thirty Years of Contests and Competitions](https://www.researchgate.net/publication/275441821_Tournament_Theory_Thirty_Years_of_Contests_and_Competitions)
220222- [Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf)
221223- [Asymmetry of verification and verifier's law](https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law)
224224+- [Ostrom's Common Pool Resource Management](https://earthbound.report/2018/01/15/elinor-ostroms-8-rules-for-managing-the-commons/)
225225+- [Community Notes Note ranking algorithm](https://communitynotes.x.com/guide/en/under-the-hood/ranking-notes)
226226+- [Deep Funding is a Special Case of Generalized Impact Evaluators](https://hackmd.io/@dwddao/HypnqpQKke)