As training regimes go, being forced to watch eight years worth of someone else playing Minecraft feels pretty harsh. When the revolution comes I fear OpenAI could be first against the wall after the robot uprising after what it’s put its latest AI through in order to get it to play the standard version of Minecraft.
I mean, OpenAI already created DALL-E, and is therefore responsible for the DALL-E mini (opens in new tab) homage by association. That’s the famed AI now forced to create memes 24/7 for the internet at large
For the historical record, I for one welcome our digital overlords and have never kicked a DARPA big dog across a parking lot, jumped in front of an autonomous car repeatedly, or bugged an AI assistant incessantly for larks. I’m on your side, bots.
With all that being said, I still think the fact the OpenAI neural network can now craft a diamond pickaxe off its own back is actually pretty darned incredible.
The detailed blog post on the OpenAI (opens in new tab) site (via SingularityHub (opens in new tab)) explains how it managed to teach the network to play Minecraft, and it’s some fascinating stuff. Not least how, of those 70,000 hours of Minecraft gameplay footage, it paid $160,000 to a team of contractors to create and tag up 2,000 hours of footage with labels so the AI could understand what it was looking at and how that related to its actions in the game.
The method is called Video PreTraining (VPT) and it claims its model can learn to craft diamond tools, which it says takes a proficient human around 20 minutes.
“Additionally, the model performs other complex skills humans often do in the game,” states the OpenAI post, “such as swimming, hunting animals for food, and eating that food. It also learned the skill of “pillar jumping”, a common behavior in Minecraft of elevating yourself by repeatedly jumping and placing a block underneath yourself.”
It’s also worth noting that this uses the standard interface of mouse and keyboard, too, not some special AI-focused build of the game.
If it was just watching the videos without context it would be extremely challenging to train a neural network, which is why it retained a pool of contractors to create a smaller dataset where they recorded both their video and the actions they took—keypresses and mouse movements. This is then used to tag that 2,000 hours of footage, and then train something called an Inverse Dynamics Model (IDM) with that dataset so it can then go off and tag the larger 70,000 hour dataset accurately.
It’s this tagged video content that is seemingly the key to training such complex and open behaviours as you’ll find in Minecraft. The use of VPT then has kinda been proven, and the future of this as a training method means that, as OpenAI states, it “paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet.”
Though whether that’s something to wonder at or fear for I’m still not sure. I’ve seen YouTube, there’s a lot of terrible stuff on there. There’s a lot of me on there, for god’s sake.