Model | F1 | EM | Executability | |
API-based Model |
GPT-4 | 81.8 | 79.2 | 88.4 |
GLM-4 | 72.4 | 70.4 | 82.7 | |
Open-Sourced Model (Large) |
Qwen1.5-72B-Chat | 74.7 | 72.2 | 96.1 |
Llama-3-70B-Instruct | 80.7 | 77.8 | 97.0 | |
DeepSeek-LLM-67B-Chat | 69.6 | 66.8 | 86.3 | |
Open-Sourced Model (Medium) |
Qwen1.5-32B-Chat | 64.6 | 62.1 | 83.0 |
Qwen1.5-14B-Chat | 66.0 | 61.6 | 78.7 | |
Baichuan2-13B-Chat | 43.7 | 42.0 | 82.2 | |
Open-Sourced Model (Small) |
Llama-3-8B-Instruct | 54.7 | 51.3 | 84.8 |
Qwen1.5-7B-chat | 44.5 | 40.3 | 77.9 | |
Open-Sourced Model (MoE) |
Mixtral-8x7B-Instruct-v0.1 | 70.1 | 67.9 | 84.7 |
Starling-LM-alpha-8x7B-MoE-GPTQ | 12.4 | 10.9 | 30.7 | |
Qwen1.5-MoE-A2.7B-Chat | 28.7 | 26.7 | 71.9 |
Model | Accuracy | Right Quotes | Error | |
API-based Model |
GPT-4 | 83.9 | 87.7 | 0.4 |
GLM-4 | 86.9 | 86.5 | 0.6 | |
Open-Sourced Model (Large) |
Qwen1.5-72B-Chat | 85.7 | 83.3 | 0.1 |
Llama-3-70B-Instruct | 85.9 | 86.6 | 0.2 | |
DeepSeek-LLM-67B-Chat | 76.6 | 82.6 | 0.4 | |
Open-Sourced Model (Medium) |
Qwen1.5-32B-Chat | 79.7 | 83.0 | 0.4 |
Qwen1.5-14B-Chat | 66.1 | 67.4 | 0.2 | |
Baichuan2-13B-Chat | 26.3 | 35.8 | 33.6 | |
Open-Sourced Model (Small) |
Llama-3-8B-Instruct | 78.5 | 83.3 | 0.5 |
Qwen1.5-7B-chat | 72.5 | 39.1 | 2.2 | |
Open-Sourced Model (MoE) |
Mixtral-8x7B-Instruct-v0.1 | 77.8 | 82.5 | 2.3 |
Starling-LM-alpha-8x7B-MoE-GPTQ | 55.0 | 56.2 | 0.1 | |
Qwen1.5-MoE-A2.7B-Chat | 55.0 | 57.8 | 3.0 |
model | KG query task | validation task | final result | |||
tool selection |
executability | tool selection |
executability | Exact Match |
executability | |
task type: web database KG check | ||||||
GPT-4 | 65.00 | 65.00 | 88.10 | 88.80 | 64.50 | 96.90 |
Llama-3-70B-Instruct | 96.90 | 96.90 | 97.50 | 97.50 | 38.10 | 100.00 |
task type: publication database KG check | ||||||
GPT-4 | 67.70 | 67.70 | 69.20 | 69.20 | 61.50 | 95.40 |
Llama-3-70B-Instruct | 100.00 | 100.00 | 95.40 | 95.40 | 41.50 | 63.10 |