We used to compare Llama 2 7b, 13b and 70b (chat-hf fine-tuned) vs OpenAI gpt-3.5-turbo and gpt-4. We used a 3-way verified hand-labeled set of 373 news report statements and presented one correct and one incorrect summary of each. Each LLM had to decide which statement was the factually correct summary.😭
[(https://link.medium.com/ugIcBrTXxCb)
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (0)