Leaderboard(KGQA)

Model F1 EM Executability
API-based
 Model
GPT-4 81.8  79.2  88.4 
GLM-4 72.4  70.4  82.7 
Open-Sourced
 Model
(Large)
Qwen1.5-72B-Chat 74.7  72.2  96.1 
Llama-3-70B-Instruct 80.7  77.8  97.0 
DeepSeek-LLM-67B-Chat 69.6  66.8  86.3 
Open-Sourced
 Model
(Medium)
Qwen1.5-32B-Chat 64.6  62.1  83.0 
Qwen1.5-14B-Chat 66.0  61.6  78.7 
Baichuan2-13B-Chat 43.7  42.0  82.2 
Open-Sourced
 Model
(Small)
Llama-3-8B-Instruct 54.7  51.3  84.8 
Qwen1.5-7B-chat 44.5  40.3  77.9 
Open-Sourced
 Model
(MoE)
Mixtral-8x7B-Instruct-v0.1 70.1  67.9  84.7 
Starling-LM-alpha-8x7B-MoE-GPTQ 12.4  10.9  30.7 
Qwen1.5-MoE-A2.7B-Chat 28.7  26.7  71.9 

Leaderboard(SCV)

Model Accuracy Right Quotes Error
API-based
 Model
GPT-4 83.9  87.7  0.4 
GLM-4 86.9  86.5  0.6 
Open-Sourced
 Model
(Large)
Qwen1.5-72B-Chat 85.7  83.3  0.1 
Llama-3-70B-Instruct 85.9  86.6  0.2 
DeepSeek-LLM-67B-Chat 76.6  82.6  0.4 
Open-Sourced
 Model
(Medium)
Qwen1.5-32B-Chat 79.7  83.0  0.4 
Qwen1.5-14B-Chat 66.1  67.4  0.2 
Baichuan2-13B-Chat 26.3  35.8  33.6 
Open-Sourced
 Model
(Small)
Llama-3-8B-Instruct 78.5  83.3  0.5 
Qwen1.5-7B-chat 72.5  39.1  2.2 
Open-Sourced
 Model
(MoE)
Mixtral-8x7B-Instruct-v0.1 77.8  82.5  2.3 
Starling-LM-alpha-8x7B-MoE-GPTQ 55.0  56.2  0.1 
Qwen1.5-MoE-A2.7B-Chat 55.0  57.8  3.0 

Leaderboard(Agent Task)

model KG query task validation task final result
tool
selection
executability tool
selection
executability Exact
Match
executability
task type: web database KG check
GPT-4 65.00  65.00  88.10  88.80  64.50  96.90 
Llama-3-70B-Instruct 96.90  96.90  97.50  97.50  38.10  100.00 
task type: publication database KG check
GPT-4 67.70  67.70  69.20  69.20  61.50  95.40 
Llama-3-70B-Instruct 100.00  100.00  95.40  95.40  41.50  63.10