Brainsteam

Small Large Language Model might sound like a bit of an oxymoron. However, I think it perfectly describes the class of LLMs in the 1-10 billion parameter range like Llama and Phi 3. In the last few days, Meta and Microsoft have both released these open(ish) models that can happily run on normal hardware. Both models perform surprisingly well for their size, competing with much larger models like GPT 3.5 and Mixtral. However, how well do they generalise to new unseen tasks? Can they do biology?

Introducing Llama and Phi

Meta's offering, Llama 3 8B, is an 8 billion parameter model that can be run on a modern laptop. It performs almost as well as Mixtral 8x22B mixture-of-expert model, a model 22x bigger and compute intensive.

Microsoft's model, Phi 3 mini, is around half the size of Llama 3 8B at 3.8 billion parameters. It is small enoughthat it can run on a high end smartphone at a reasonable speed. Incredibly, Phi actually beats Llama 3 8B, which is twice as big, at a few popular benchmarks including MMLU which approximately measures "how well does this model behave as a chatbot" and HumanEval which measures "how well can this model write code?".

I've also read a lot of anecdotal evidence about people chatting to these models and finding them quite engaging and useful chat partners (as opposed to previous generation small models). This seems to back up the benchmark performance and provide some validation of the models' utility.

Both Microsoft and Meta have stated that the key difference between these models and previous iterations of their smaller LLMs is the training regime. Interestingly, both companies applied very different training strategies . Meta trained Llama3 on over 15 trillion tokens (words) which is unusually large for a small model. Microsoft trained Phi on much smaller training sets curated for high quality.

Can Phi, Llama and other Small Models Do Biology?

Having a model small enough to run on your phone and generate funny poems or trivia questions is neat. However, for AI and NLP practitioners, a more interesting question is "do these models generalise well to new, unseen problems?"

I set out to gather some data about how well Phi and Llama 3 8B generalise to a less-well-known task. As it happened, I have recently been working with my friend Dan Duma on a test harness for BioAsq Task B. This is a less widely-known, niche NLP task in the bio-medical space. The model is fed a series of snippets from scientific papers and asked a question which it must answer correctly. There are four different formats of question which I'll explain below.

The 11th BioASQ Task B leaderboard is somewhat dominated by GPT-4 entrants with perfect scores at some of the sub-tasks. If you were somewhat cynical, you might consider this task "solved". However, we think it's an interesting arena for testing how well smaller models are catching up to big commercial offerings.

BioASQ B is primarily a reading comprehension task with a slightly niche subject-matter. The models under evaluation are unlikely to have been explicitly trained to answer questions about this material. Smaller models are often quite effective at these sorts of RAG-style problems since they do not need to have internalised lots of facts and general information. In fact, in their technical report, the authors of Phi-3 mini call out the fact that their model can't retain factual information but could be augmented with search to produce reasonable results. This seemed like a perfect opportunity to test it out.

How The Task Works

There are 4 types of question in task B. Factoid, Yes/No, List and Summary. However, since summary is quite tricky to measure, it is not part of the BioASQ leaderboard. We also chose to omit summary from our tests.

Each question is provided along with a set of snippets. These are full sentences or paragraphs that have been pre-extracted from scientific papers. Incidentally, that activity is BioASQ Task A and it requires a lot more moving parts since there's retrieval involved too. In Task B we are concerned with existing sets of snippets and questions only.

In each case the model is required to respond with a short and precice exact answer to the question. The model may optionally also provide an ideal answer which provides some rationale for that answer. The ideal answer may provide useful context for the user but is not formally evaluated as part of BioASQ.

Yes/No questions require an exact answer of just "yes" or "no". For List questions, we are looking for a list of named entities (for example symptoms or types of microbe). For factoid we are typically looking for a single named entity. Models are allowed to respond to factoids with multiple answer. Therefore, factoids answers are scored by how closely to the top of the list the "correct" answer is ranked.

The Figure from the Hseuh et al 2023 Paper below illustrates this quite well:

Examples of different question types. Full transcriptions of each are:
<p>Yes/No
Question: Proteomic analyses need prior knowledge of the organism complete genome. Is the complete genome of the bacteria of the genus Arthrobacter available?
Exact Answer: yes
Ideal Answer: Yes, the complete genome sequence of Arthrobacter (two strains) is deposited in GenBank.</p>
<p>List
Question: List Hemolytic Uremic Syndrome Triad.
Exact Answer: [anaemia, thrombocytopenia, renal failure]
Ideal Answer: Hemolytic uremic syndrome (HUS) is a clinical syndrome characterized by the triad of anaemia, thrombocytopenia, renal failure.</p>
<p>Factoid
Question: What enzyme is inhibited by Opicapone?
Exact Answer: [catechol-O-methyltransferase]
Ideal Answer: Opicapone is a novel catechol-O-methyltransferase (COMT) inhibitor to be used as adjunctive therapy in levodopa-treated patients with Parkinson’s disease</p>
<p>Summary
Question: What kind of affinity purification would you use in order to isolate soluble lysosomal proteins?
Ideal Answer: The rationale for purification of the soluble lysosomal proteins resides in their characteristic sugar, the mannose-6-phosphate (M6P), which allows an easy purification by affinity chromatography on immobilized M6P receptors." class=“wp-image-2510”/><figcaption class=

Our Setup

We wrote a python script that passes the question, context and guidance about the type of question to the model. We used a patched version of Ollama that allowed us to put restrictions on the shape of the model output. This allowed us to ensure responses were valid JSON in the same shape and structure as the BioASQ examples. These forced grammars saved us loads of time trying to coax JSON out of models in the structure we want. This is something that smaller models aren't great at. Sometimes models would still fail to give valid responses. For example, sometimes they get stuck in infinite loops spitting out brackets or newlines. We gave models up to 3 chances to produce a JSON response before a question is marked unanswerable and skipped.

Prompts

We used exactly the same prompts for all of the models which may have left room for further performance improvements. The exact prompts and grammar constraints that we used can be found here. Snippets are concatenated together with newlines in between them and provided as "context" in the prompt template.

We used the official BioASQ scoring tool to evaluate the responses and produce the results below. We evaluated our pipeline on the Task 11B Golden Enriched test set. You have to create a free account at bioasq to log in and download the data.

Models

We compared quantized versions of Phi and Llama with some other similarly sized models which perform well at benchmarks.

Note that although Phi is approx. half the size of the other models, the authors report competitive results against much larger models for a number of widely used benchmarks so it seems reasonable to compare it with these 7B and 8B models as oppose to only benchmarking against other 4B and smaller models.

Results

Yes/No Questions

The simplest type of BioASQ question is Yes/No. These results are measured with macro F1 to allow us to get a single metric across the performance at both "yes" and "no" questions.

Diagram of Yes/No F1
<p>Llama3 gets 1.0
Mistral gets 0.8
Phi gets 0.7
Starling gets 0.9
Zehpyr gets 0.85</p>
<p>The bars on the chart have little range indicators because they represent the average values over 4 sets of results." class=“wp-image-2502”/></figure></p>
<!-- /wp:image -->
<!-- wp:paragraph -->
<p>The results show that all 5 models perform reasonably well at this task but Phi 3 lags behind the others a little bit, but only by about 10% next to it's closest competitor. <a href=

Factoid results
<p>Llama gets roughly 0.55 MRR
Mistral gets rouglhy 0.05 MRR
Phi 3 gets roughly 0.15 MRR
Starling gets roughly 0.17 MRR
Zephyr gets roughly 0.12 MRR</p>
<p>The bars on the chart have little range indicators because they represent the average values over 4 sets of results." class=“wp-image-2501”/></figure></p>
<!-- /wp:image -->
<!-- wp:paragraph -->
<p>This graph is a lot starker than the yes/no graph. Llama 3 outperforms it's next closest neighbour by a significant margin (roughly +0.40 MRR) . The best solution to this task, again <a href=

Can Phi3 and Llama3 Do Biology?

Introducing Llama and Phi

Can Phi, Llama and other Small Models Do Biology?

How The Task Works

Our Setup

Prompts

Models

Results

Yes/No Questions

Factoid Questions

List Questions

Discussion and Conclusion

Llama3

Phi

Further Experiments

Model Size

Model Fine Tuning

Better Prompts

Other Tasks

Replies & Web Activities