📝 Enhance weight evaluation metrics · davidgasquez.com/handbook@537fabb

+8 -5

1 changed file

expand all

+8 -5

Deep Funding.md

··· 58 58 - In the pilot (huber loss), some projects got weights on a scale jurors didn't feel reasonable (e.g: EIPs repo got 30%) 59 59 - The prediction market might cause good modelers to not participate as time of entry is more important than having a good model 60 60 - **Weights Evaluation** 61 - - [How do we measure success?](https://davidgasquez.com/weight-allocation-mechanism-evals/) If the goal of pattern recognition was to classify objects in a scene, it made sense to score an algorithm by how often it succeeded in doing so. What is the equivalent for Deep Funding? 61 + - [How do we measure success?](https://davidgasquez.com/weight-allocation-mechanism-evals/) If the goal of pattern recognition was to classify objects in a scene, it made sense to score an algorithm by how often it succeeded in doing so. What is the equivalent for Deep Funding? What is the [metric we are optimizing](https://mlhp.stanford.edu/src/chap4.html#sec-metric-elicitation)? 62 62 - Once the weights are set, there isn't [a process to evaluate how "fit" those are](https://davidgasquez.com/weight-allocation-mechanism-evals/) 63 63 - E.g: the current idea is to gather a connected graph of pairwise comparisons, why not use that to reward projects directly and skip the Prediction Market? 64 64 - We need a falsifiable hypotheses to validate Deep Funding is "better" ··· 70 70 71 71 ### Alternative Approach 72 72 73 - Given the current open problems, this is interesting and alternative way of running a Deep Funding "round". The gist of the idea is to **use only a few significant data points to choose and reward the final models** instead of deriving weights for the entire set of childs/dependencies of a project. 73 + Given the current open problems, this is interesting and alternative way ([inspired by RLHF](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf)) of running a Deep Funding "round". The gist of the idea is to **use only a few significant data points to choose and reward the final models** instead of deriving weights for the entire set of childs/dependencies of a project. Resolve the market with only a few, well-tested pairs! 74 74 75 75 Like in the current setup, a DAG of projects is needed. The organizers publish that and also an encoded list of projects that will be evaluated by Jurors. Participants can only see the DAG, the "evaluated projects" will be revealed at the end. 76 76 77 - Once participans have worked on their models and send/trade their predictions, the "evaluated project" list is revealed and only those projects are used to evaluate predictions. The question here is... how can we evaluate only a few projects without jurors giving a connected graph to the rest of the projects? You can use metrics like the [Brier score](https://en.wikipedia.org/wiki/Brier_score) or using Bradley Terry to evaluating a pre-given mechanism's scores ([in that case you're fitting just a single global scale or temperature parameter to minimize negative log-likelihood](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-3-reward-modeling-human-preferences/reward-model-calibration))! 77 + Once participans have worked on their models and send/trade their predictions, the "evaluated project" list is revealed and only those projects are used to evaluate predictions. Best strategy is to price truthfully on the unknown benchmark subset. The question here is... how can we evaluate only a few projects without jurors giving a connected graph to the rest of the projects? You can use metrics like the [Brier score](https://en.wikipedia.org/wiki/Brier_score) or using [calibrated Bradley Terry](https://www.itai-shapira.com/pdfs/pairwise_calibrated_rewards_for_pluralistic_alignment.pdf) to evaluating a pre-given mechanism's scores ([in that case you're fitting just a single global scale or temperature parameter to minimize negative log-likelihood](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-3-reward-modeling-human-preferences/reward-model-calibration))! 78 + 79 + Once the best model is chosen, the same pairwise comparisons can be used [to adjust the scale of the weight distribution](https://proceedings.mlr.press/v70/guo17a/guo17a.pdf). That means the market resolution uses only the subset (for payouts to traders) but the funding distribution uses the model's global ranking with its probabilities calibrated to the subset via a single scalar 𝑎 that pins the entire slate to the same scale that was verified by real judgments. The jurors pairwise comparisons can even be "merged" with the best model to incorporate all data in there. 78 80 79 - The task of the organizers is to gather pairwise comparisons to make this subset significant, which is much simpler and feasible than doing it so for the entire dependencies of a node (can be 128). For example, we can estimate that to get a 10% relative error on the weights, we would need ~600 efficiently sampled pairs. Compare that with the 2000 needed to get a 20% relative error on 128 items. 81 + The task of the organizers is to [gather pairwise comparisons to make this subset significant](https://arxiv.org/pdf/1505.01462), which is much simpler and feasible than doing it so for the entire dependencies of a node (can be 128). For example, we can estimate that to get a 10% relative error on the weights, we would need ~600 [efficiently sampled pairs](https://arxiv.org/abs/2302.13507). Compare that with the 2000 needed to get a 20% relative error on 128 items. 80 82 81 83 ### More Ideas 82 84 83 - - There are beter methods to derive weights from [noisy pairwise comparisons](https://arxiv.org/abs/2510.09333) ([from multiple annotators](https://arxiv.org/abs/1612.04413)) 85 + - [Detect and correct for evaluators' bias in the task of ranking items from pairwise comparisons](https://link.springer.com/article/10.1007/s10618-024-01024-z) 86 + - There are beter and more modern methods to derive weights from [noisy pairwise comparisons](https://arxiv.org/abs/2510.09333) ([from multiple annotators](https://arxiv.org/abs/1612.04413)) 84 87 - Use active ranking or dueling bandits to [speed up the data gathering process](https://projecteuclid.org/journals/annals-of-statistics/volume-47/issue-6/Active-ranking-from-pairwise-comparisons-and-when-parametric-assumptions-do/10.1214/18-AOS1772.pdf) 85 88 - Do some post processing to the weights: 86 89 - Report accuracy/Brier and use paired bootstrap to see if gap is statistically meaningful

Configure Feed

Configure Feed