Evaluation of the Pedagogical Performance of Generative Artificial Intelligence Systems: Comparative Study Between chatgpt-4.0 and Deepseek-R1 in the Learning of Mathematics in the Moroccan Baccalaureate
- DOI
- 10.2991/978-94-6239-634-0_15How to use a DOI?
- Keywords
- Generative Artificial Intelligence; ChatGPT-4.0; DeepSeek-R1; pedagogical performance; mathematics education; baccalaureate exam
- Abstract
The rapid advances in generative AI technologies are revealing a disruptive potential in educational practices, raising new challenges while opening up unprecedented prospects for innovation, particularly in scientific disciplines such as mathematics. This requires an in-depth analysis of the pedagogical performance and limitations associated with this emerging technology. This study aims to evaluate and compare the pedagogical performance of ChatGPT-4.0 and DeepSeek-R1 within various cognitive levels related to learning mathematics at Moroccan baccalaureate level. This research focused on a corpus of 120 mathematical questions extracted from 11 monothematic exercises and three problems from the national scientific baccalaureate exams. The evaluation was based on the systematic administration of these exercises to the two generative AI models examined, in accordance with a standardised methodological protocol designed to guarantee the objectivity and reliability of the results. The answers provided were then collected and marked using an evaluation grid and compared with an official answer key drawn up by the French Ministry of Education. Finally, a comparative statistical analysis was carried out to examine the pedagogical performance of each model. The results of this analysis highlight the superiority of ChatGPT-4.o, with an overall score of 16.12/20 (80.6%), compared with DeepSeek-R1, which obtained an overall score of 12.80/20 (64.8%). On the other hand, DeepSeek-R1 showed notable performance, close to that of ChatGPT-4.o, for monothematic exercises, with a score of 15.95/20 (79.75%), compared with 17.18/20 (85.9%) for ChatGPT-4o. However, both models have limitations, including factual inaccuracies and a lack of contextual accuracy, these being particularly apparent in problems requiring deep mathematical reasoning. This study provides relevant data on the pedagogical performance of two generative AI models in secondary school mathematics education. It thus contributes to the discussion of the challenges associated with learning and the integrity of assessments, particularly in the context of online teaching.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Omar Oulad Ebrahim AU - Abderrahim El Mhouti AU - Mostafa Allaoui PY - 2026 DA - 2026/04/02 TI - Evaluation of the Pedagogical Performance of Generative Artificial Intelligence Systems: Comparative Study Between chatgpt-4.0 and Deepseek-R1 in the Learning of Mathematics in the Moroccan Baccalaureate BT - Proceedings of the E-Learning and Smart Engineering Systems (ELSES 2025) PB - Atlantis Press SP - 186 EP - 196 SN - 2667-128X UR - https://doi.org/10.2991/978-94-6239-634-0_15 DO - 10.2991/978-94-6239-634-0_15 ID - Ebrahim2026 ER -