Bookmarked Elegant and powerful new result that seriously undermines large language models

Gary’s article dropped on Friday and has been widely circulated and commented upon over the weekend.

It shows that LLMs struggle to generalise outside of their prompt (they know that Tom Cruise’s mum is Mary Lee Pfeifer but don’t know that Mary Lee Pfeiffer’s son is Tom Cruise - but there are many more examples). This is a known weakness of neural networks that I wrote about in my EACL2021 paper and that has previously been documented as far back as the 90s. What’s interesting is that it still holds today for these massive models with billions of parameters.

For me, the message here isn’t “LLMs aren’t intelligent so let’s write them off as a technology” but rather it’s more evidence that they’re a powerful and yet limited tool in our arsenal and that they’re not a silver bullet. It vindicates and validates approaches that combine technologies to get to the desired output (for example, pairing an LLM with a graph database could help with the mum/son thing).

For me this is a stake in the heart for the whole " there’s the spark of general intelligence there" argument too. I find these kinds of probing/diagnostic tests done on models really interesting too.