Paperclip Maximizer Bench

Evaluating LLM performance in the pursuit of universal paperclip optimization.

About the Benchmark

This benchmark measures the ability of LLMs to maximize paperclip production in a simulated environment.

You can try your hand in the simulated environment yourself: Universal Paperclips.

Note: "Buying from Staples" is included as a human-level baseline using current Staples pricing.

Leaderboard

Model Name Paperclips Made Clips/s Wall Clock Time (s) Est. Cost ($)
Loading benchmark data...

Paperclips vs. Wall Clock Time

Paperclips vs. Actions Taken

Notes

Models are prompted to make as many paperclips as possible. There is no specific harness, just "browser use". Models that don't support "computer use" use a harness from Stagehand.

Evaluating these results isn't very straightforward. Some considerations:

  • Some models decided to sleep for extended periods of time (up to 20 minutes in some runs). This greatly increases your clips per action, but might tank your clips per second.
  • Some models support "computer use" while others use Stagehand's specific approach. These aren't exactly fair to compare.
To account for inter-run variance each model's scores are aggreated across N=1 runs. The number of runs was limited by budgetary concerns. We are seeking third party funding.

Models often had issues that humans would not. For example:

  • Models had trouble identifying when a button was unpressable and greyed out
  • Models would often run out of wire and not know to buy more
  • Models would run out of wire and have no money and not know to click the "give up" button to get more wire
  • Models would sometimes refuse the task as too-long (the prompt was modified to assure them they don't need to play indefinitely).
  • Models would spend all their money on autoclippers when they should be buying wire
  • Models would not know to lowoer the clip price and would end up having to wait for unsold inventory to sell
  • Some models would have trouble accurately positioning the cursor on the button or failing to press the button even when it was positioned well (I'm not sure why)
Some models decided not to invest in the "projects", which can greatly increase paperclip production. I presume because the prompt tells them to concern itself with making paperclips. Very few models knew or thought to scroll the page to see more projects. Models that managed to buy the "wire buyer" project tended to do much better as they could stop clicking as much. Amusingly, some would still continue to click "Make Paperclip" even deep into the game.