📝 Revise Impact Evaluators documentation to enhance clarity on community feedback mechanisms and simplify evaluation principles

+52 -48

1 changed file

expand all

+52 -48

Impact Evaluators.md

··· 9 9 - The goal is to **create strong incentives for people/teams to work on valuable, uncertain things** by promising a reward if they succeed in creating demonstrable impact. 10 10 - Impact Evaluators work well on concrete things that you can turn into measurable stuff. 11 11 - They are powerful things and will overfit. When the goal is not well aligned, they can be harmful. E.g: Bitcoin wasn't created to maximize the energy consumption. An Impact Evaluators might become an Externalities Maximizers. 12 - - **Community Feedback Mechanism**. Implement robust feedback systems that allow participants to report and address concerns about the integrity of the metrics or behaviors in the community. Use the feedback to refine and improve the system. 13 - - Designing IEs has the side effect of making impact more legible, decomposed into specific properties, which can be represented by specific metrics. 14 - - Something like l2beat as a leaderboard 15 - - IEs should [make "making the next L2beat" a permissionless process](https://vitalik.eth.limo/general/2024/09/28/alignment.html) for the space. Independent entities should arise to evaluate how projects met the IE criteria. 16 - - Do more to make different aspects of alignment legible, while not centralizing in one single "watcher", we can make the concept much more effective, and fair and inclusive in the way that the Ethereum ecosystem strives to be. 17 - - Impact Evaluators need to be (permissionless) forkable. 18 - - Anyone should be able to [fork the evaluation system with their own criteria](https://vitalik.eth.limo/general/2024/09/28/alignment.html), preventing capture and enabling experimentation. 19 12 - **Start local and iterate**. Begin with small communities defining their own [[Metrics]] and evaluation criteria. Use rapid [[Feedback Loops]] to learn what works before scaling up. 20 13 - Each community understands its context better than outsiders ([seeing like a state blinds you to local realities](https://slatestarcodex.com/2017/03/16/book-review-seeing-like-a-state/)) 21 14 - Local experiments surface patterns for higher-level systems ··· 26 19 - Figure out system structures and incentives and use as an examples for the level above. 27 20 - Focus on positive sum games and mechanisms. 28 21 - E.g: OSO's "developer count" requires +5 commits to be counted. You might or might not align with that metric. 29 - - IEs, as most systems should have a deadline or something like that so it fades away if it's not working. 30 - - Fix rules to keep things simple and easy to play. Opinionated framework with sane defaults! 31 - - [IEs are the scientific method in disguise like AI evals](https://eugeneyan.com/writing/eval-process/). You need automated IEs, which is basically science applied to building better systems. You also need human oversight. 32 - - For areas with continuous output (e.g: minting for "better path finding algorightms"), follow Bittensor model. 22 + - **Community Feedback Mechanism**. Implement robust feedback systems that allow participants to report and address concerns about the integrity of the metrics or behaviors in the community. Use the feedback to refine and improve the system. 23 + - [Every community and institutions wants to see a better, more responsive and dynamic provision of public goods within them, usually lack information about which goods have the greatest value and know quite a bit about social structure internally which would allow them to police the way GitCoin has in the domains it knows](https://gov.gitcoin.co/t/a-vision-for-a-pluralistic-civilizational-scale-infrastructure-for-funding-public-goods/9503/11). 24 + - IE's helps a community with more data and information to make better decisions. 25 + - Open Data Platforms for the community to gather better data and make better decisions. 26 + - Can open data be rewarded with an IE? What does a block reward mean there? 27 + - Prioritize consent and community feedback. 28 + - Community should steer the ship. 29 + - Design a democratic control that reacts to feedback. 30 + - Allow people to express themselves as much as they want. 31 + - Super expert with lots of context already have the weights! 33 32 - IEs are like nuclear power: extremely powerful if used correctly, but so very easy to get wrong, and when things go wrong the whole thing blows up in your face. 33 + - For areas with continuous output (e.g: minting for "better path finding algorightms"), follow Bittensor model. 34 + - IEs, as most systems should have a deadline or something like that so it fades away if it's not working. 35 + - **Simplicity as a principle**. Fix rules to keep things simple and easy to play. Opinionated framework with sane defaults! 36 + - [The simpler a mechanism, the less space for hidden privilege](https://vitalik.eth.limo/general/2020/09/11/coordination.html). Fewer parameters mean more resistance to corruption and overfit and more people engaging. 37 + - Demonstrably fair and impartial to all participants (open source and publicly verifiable execution), with no hidden biases or privileged interests 38 + - Don't write specific people or outcomes into the mechanism (e.g: using multiple accounts) 34 39 - **Build anti-Goodhart resilience**. Any metric used for decisions [becomes subject to gaming pressures](https://en.wikipedia.org/wiki/Campbell%27s_law). Design for evolution: 35 40 - Run multiple evaluation algorithms in parallel and let humans choose 36 41 - Use exploration/exploitation trade-offs (like multi-armed bandits) to test new metrics 37 42 - Make the meta-layer for evaluating evaluators explicit 43 + - **Collusion resistance**. Any mechanism helping under-coordinated parties will also help [over-coordinated parties extract value](https://vitalik.eth.limo/general/2019/04/03/collusion.html). Countermeasures include: 44 + - Identity-free incentives (like proof-of-work). 45 + - Fork-and-exit rights for minorities. 46 + - Privacy pools that exclude provably malicious actors. 47 + - Multiple independent "dashboard organizations" preventing capture. 48 + - They should be flexible as it's hard to predict ways the evaluation metrics will be gamed. 49 + - [Campbell's Law](https://en.wikipedia.org/wiki/Campbell%27s_law). The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor. 50 + - [The McNamara Fallacy](https://en.wikipedia.org/wiki/McNamara_fallacy). Never choose metrics on the basis of what is easily measurable over what is meaningful. Data is inherently objectifying and naturally reduces complex conceptions and process into coarse representations. There's a certain fetish for data that can be quantified. 38 51 - **Separate data from judgment**. [Impact Evaluators work like data-driven organizations](https://handbook.davidgasquez.com/data/data-culture): 39 52 - Gather objective attestations about work (commits, usage stats, dependencies) 40 53 - Apply multiple "evaluation lenses" to interpret the data ··· 43 56 - Multiple communities to share measurement infrastructure 44 57 - Different evaluation methods to operate on the same data 45 58 - Evolution through recombination rather than redesign 59 + - IEs should define also a Data Structure for each layer so they can compose (graph, weight vector). That is the API 60 + - E.g: Deepfunding problem data structure is a graph. Weights are a vector/dict, ... 46 61 - **Embrace plurality over perfection**. [No single mechanism can satisfy all desirable properties](https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem) (efficiency, fairness, incentive compatibility, budget balance). Different contexts need different trade-offs. 47 62 - There is no "end of history" because whenever you fix an evaluation, some group has an incentive to abuse or break it again and feast on the wreckage. 48 63 - This is the formal impossibility theorem that no mechanism can simultaneously achieve four desirable criteria: ··· 51 66 - Individual Rationality: Ensuring that every participant has a non-negative utility (or at least no worse off) by participating in the mechanism. 52 67 - Budget Balance: The mechanism generates sufficient revenue to cover its costs or payouts, without running a net deficit. 53 68 - When collecting data, [pairwise comparisons and rankings are more reliable than absolute scoring](https://anishathalye.com/designing-a-better-judging-system/). Humans excel at relative judgments, but struggle with absolute judgments. 69 + - Pairwise shines when all the context is in the UX. 54 70 - **Legible Impact Attribution**. Make contributions and their value visible. [Transform vague notions of "alignment" into measurable criteria](https://vitalik.eth.limo/general/2024/09/28/alignment.html) that projects can compete on. 71 + - Designing IEs has the side effect of making impact more legible, decomposed into specific properties, which can be represented by specific metrics 72 + - Something like l2beat as a leaderboard 73 + - IEs should [make "making the next L2beat" a permissionless process](https://vitalik.eth.limo/general/2024/09/28/alignment.html) for the space. Independent entities should arise to evaluate how projects met the IE criteria 74 + - Do more to make different aspects of alignment legible, while not centralizing in one single "watcher", we can make the concept much more effective, and fair and inclusive in the way that the Ethereum ecosystem strives to be 55 75 - Support organizations like L2beat to track project alignment 56 76 - Let projects compete on measurable criteria rather than connections 57 77 - Enable neutral evaluation by EF and others 58 78 - Create separation of powers through multiple independent "dashboard organizations" 79 + - Seeing like a State blinds you to the realities that are complex. Need a way to evolve the metric to be anti-Goodhart's 80 + - Not even anti-goodharts. Research says the best thing to do is to give all money to vaccine distribution, ... 59 81 - Tradeoffs in public goods funding approaches: 60 82 - Voting on models: feels too abstract for voters and doesn't leverage their specific project expertise 61 83 - Voting on metrics: judges just play with numbers until they get their favored allocation ··· 72 94 - [Peer prediction mechanisms](https://jonathanwarden.com/information-elicitation-mechanisms/) that reward agreement with hidden samples 73 95 - [Bayesian Truth Serum](https://www.science.org/doi/10.1126/science.1102081) that uses both answers and predictions. 74 96 - Coordination games where truth serves as a Schelling point. 75 - - **Collusion resistance**. Any mechanism helping under-coordinated parties will also help [over-coordinated parties extract value](https://vitalik.eth.limo/general/2019/04/03/collusion.html). Countermeasures include: 76 - - Identity-free incentives (like proof-of-work). 77 - - Fork-and-exit rights for minorities. 78 - - Privacy pools that exclude provably malicious actors. 79 - - Multiple independent "dashboard organizations" preventing capture. 80 - - They should be flexible as it's hard to predict ways the evaluation metrics will be gamed. 81 - - [The simpler a mechanism, the less space for hidden privilege](https://vitalik.eth.limo/general/2020/09/11/coordination.html). Fewer parameters mean more resistance to corruption and overfit and more people engaging. 82 - - Demonstrably fair and impartial to all participants (open source and publicly verifiable execution), with no hidden biases or privileged interests 83 - - Don't write specific people or outcomes into the mechanism (e.g: using multiple accounts) 84 97 - [An allocation mechanism can be seen as a measurement process, with the goal being the reduction of uncertainty concerning present beliefs about the future. An effective process will gather and leverage as much information as possible while maximizing the signal-to-noise ratio of that information — aims which are often at odds](https://blog.zaratan.world/p/quadratic-v-pairwise). 85 98 - In the digital world, we can apply several techniques to the same input and evaluate the potential impacts. E.g: Simulate different voting systems and see which one fits the best with the current views. This is a case for the system to have a final mechanism that acts as a layer for human to express preferences. 86 - - [Every community and institutions wants to see a better, more responsive and dynamic provision of public goods within them, usually lack information about which goods have the greatest value and know quite a bit about social structure internally which would allow them to police the way GitCoin has in the domains it knows](https://gov.gitcoin.co/t/a-vision-for-a-pluralistic-civilizational-scale-infrastructure-for-funding-public-goods/9503/11). 87 - - IE's helps a community with more data and information to make better decisions. 88 - - Open Data Platforms for the community to gather better data and make better decisions. 89 - - Can open data be rewarded with an IE? What does a block reward mean there? 90 - - **Legible Impact Attribution**. Make contributions and their value visible. [Transform vague notions of "alignment" into measurable criteria](https://vitalik.eth.limo/general/2024/09/28/alignment.html) that projects can compete on. 91 - - Seeing like a State blinds you to the realities that are complex. Need a way to evolve the metric to be anti-Goodhart's. 92 - - Not even anti-goodharts. Research says the best thing to do is to give all money to vaccine distribution, ... 93 - - **Embrace plurality over perfection**. [No single mechanism can satisfy all desirable properties](https://en.wikipedia.org/wiki/Arrow%27s_impossibility_theorem) (efficiency, fairness, incentive compatibility, budget balance). Different contexts need different trade-offs. 94 99 - **Make evaluation infrastructure permissionless**. Just as anyone can fork code, anyone should be able to fork evaluation criteria. This prevents capture and enables innovation. 100 + - Impact Evaluators need to be (permissionless) forkable 101 + - Anyone should be able to [fork the evaluation system with their own criteria](https://vitalik.eth.limo/general/2024/09/28/alignment.html), preventing capture and enabling experimentation 102 + - [IEs are the scientific method in disguise like AI evals](https://eugeneyan.com/writing/eval-process/). You need automated IEs, which is basically science applied to building better systems. You also need human oversight. 95 103 - **Focus on error analysis**. Like in [LLM evaluations](https://hamel.dev/blog/posts/evals-faq/), understanding failure modes matters more than optimizing metrics. Study what breaks and why. 104 + - IEs will have to do some sort of "error analysis". [Is the most important activity in LLM evals](https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed). Error analysis helps you decide what evals to write in the first place. It allows you to identify failure modes unique to your application and data. 96 105 - **Layer human judgment on algorithmic engines**. The ["engine and steering wheel" pattern](https://vitalik.eth.limo/general/2025/02/28/aihumans.html) - let algorithms handle scale while humans set direction and audit results. 106 + - Use humans for sensing qualitative properties, machines for bookkeeping and preserve legitimacy by letting people choose/vote on the prefered evaluation mechanism. 107 + - Making it so people don't have to do somehting is cool. Makeing it so people can't do that thing is bad. E.g: time saving tools like AI is great but humans should be able to jump in if they want! 108 + - If people don't want to have their "time saved" have the freedom to express themselves. E.g: offer pairwise comparisons by default but let people expand on feedback or send large project reviews. 109 + - Information gathering is messy and noisy. It's hard to get a clear picture of what people think. Let people express themselves as much as they want. 110 + - The more humans gets involved, the messier (papers, ... academia). You cannot get away from humans in most problems. 111 + - In the digital world, we can apply several techniques to the same input and evaluate the potential impacts. E.g: Simulate different voting systems and see which one fits the best with the current views. This is a case for the system to have a final mechanism that acts as a layer for human to express preferences. 97 112 - The easier to verify the solution is (e.g: verify a program passes the test vs verify the experiment replicates), the better and faster the IE can be. 98 113 - If the domain of the IE is sortable and differentiable, it's easy as it can be seen as pure optimization and doesn't require humans subjective evaluation. 99 114 - **Verify the evaluation is actually better than the baseline**. ··· 103 118 - Genetic algorithms? 104 119 - Is the evaluation/reward better than a centralized/simpler alternative? 105 120 - E.g: on tabular clinical prediction datasets, standard logistic regression was found to be on par with deep recurrent models. 106 - - [IEs need to show how the solution is produced by the interactions of people each of whom possesses only partial knowledge](https://news.ycombinator.com/item?id=44232461). 107 121 - **Exploration vs Exploitation**. IEs are optimization processes with tend to exploit (more impact, more reward). This ends up with a monopoly (100% exploit). You probably want to always have some exploration. 108 122 - Do IEs need some explore/exploit thing? E.g. Use multi-armed bandit algorithms to adaptively choose between evaluation mechanisms based on historical performance and context. 109 - - Having discrete rounds simplify the process. Like a batch pipeline. 110 - - The more humans gets involved, the messier (papers, ... academia). You cannot get away from humans in most problems. 111 - - [Campbell's Law](https://en.wikipedia.org/wiki/Campbell%27s_law). The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor. 112 - - [The McNamara Fallacy](https://en.wikipedia.org/wiki/McNamara_fallacy). Never choose metrics on the basis of what is easily measurable over what is meaningful. Data is inherently objectifying and naturally reduces complex conceptions and process into coarse representations. There’s a certain fetish for data that can be quantified. 113 - - **Composable Data Structures**. IEs should define also a Data Structure for each layer so they can compose (graph, weight vector). That is the API. 114 - - E.g: Deepfunding problem data structure is a graph. Weights are a vector/dict, ... 115 - - IEs will have to do some sort of "error analysis". [Is the most important activity in LLM evals](https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed). Error analysis helps you decide what evals to write in the first place. It allows you to identify failure modes unique to your application and data. 116 - - Film festivals are "local" IEs each one serving different values/communities. 117 - - Use humans for sensing qualitative properties, machines for bookkeeping and preserve legitimacy by letting people choose/vote on the prefered evaluation mechanism. 118 - - You can reduce coordination overhead through adaptive lazy consensus (continuous pairwise voting). 119 123 - The most important thing to do is to keep experimenting and learns from previous iterations 120 124 - Cultivate a culture which welcomes experimentation. 121 125 - Ostrom's Law. "A resource arrangement that works in practice can work in theory" 126 + - [IEs need to show how the solution is produced by the interactions of people each of whom possesses only partial knowledge](https://news.ycombinator.com/item?id=44232461). 127 + - Having discrete rounds simplify the process. Like a batch pipeline. 128 + - Film festivals are "local" IEs each one serving different values/communities. 129 + - You can reduce coordination overhead through adaptive lazy consensus (continuous pairwise voting). 122 130 - To create a permissionless way for projects to participate, staking is a solution. 123 131 - You want a reactive and self balancing system. Loops where one parts reacts the other parts. 124 132 - Feedback loop with the errors of the previous round. ··· 129 137 - What would you change about the process? 130 138 - Have a democratic way of expressing the values of the community and some representatives. 131 139 - Economist might be good at analyzing economies but doens't mean they're good at creating them. A phisicist or ecologist might be a better fit. 132 - - Making it so people don't have to do somehting is cool. Makeing it so people can't do that thing is bad. E.g: time saving tools like AI is great but humans should be able to jump in if they want! 133 - - If people don't want to have their "time saved" have the freedom to express themselves. E.g: offer pairwise comparisons by default but let people expand on feedback or send large project reviews. 134 - - Information gathering is messy and noisy. It's hard to get a clear picture of what people think. Let people express themselves as much as they want. 135 140 - Complex model of people aren't always good (performative reactions, noise, ...) 136 - - Prioritize consent and community feedback. 137 - - Community should steer the ship. 138 - - Design a democratic control that reacts to feedback. 139 - - Allow people to express themselves as much as they want. 140 - - Super expert with lots of context already have the weights! 141 - - Pairwise shines when all the context is in the UX. 142 141 143 142 ## Principles 144 143 ··· 193 192 - **Liquid Democracy** - Delegation of evaluation power to trusted experts, revocable at any time. Balances expertise with democratic control. 194 193 - **Threshold Cryptography/Secret Sharing** - For private evaluation scores that only become public when aggregated. Prevents anchoring and collusion during evaluation. 195 194 - **Augmented Bonding Curves with Vesting** - Time-locked rewards that vest based on continued positive evaluation over time, aligning long-term incentives 195 + - **Multi-armed Bandits** - Adaptive mechanism selection algorithms that balance exploration and exploitation. Dynamically choose between evaluation mechanisms based on historical performance and context to optimize for both learning and effectiveness. 196 + - **Privacy Pools** - Systems that maintain participant privacy while excluding provably malicious actors. Allow honest participants to prove non-membership in bad actor sets without revealing their identity. 197 + - **Reinforcement Learning for Meta-Evaluation** - Use RL to evolve evaluation mechanisms through trial and error. The system learns which evaluation approaches work best in different contexts by treating mechanism selection as a sequential decision problem. 198 + - **Genetic Algorithms** - Evolution-based optimization for evaluation mechanisms. Breed and mutate successful evaluation strategies, allowing the system to discover novel approaches through recombination and selection pressure. 199 + - **Schelling Point Coordination Games** - Information elicitation mechanisms where truth naturally emerges as the coordination point. Participants are incentivized to report honestly because they expect others to do the same, making truth the natural focal point. 196 200 197 201 ## Ideas 198 202

Configure Feed

Configure Feed