2025

Evaluation of the Code Generated By Large Language Models: The State of the Art

Ying, Zhihao, Towey, Dave, and Zhang, Yifan

The rapid development of Large Language Models (LLMs), such as ChatGPT and DeepSeek, has revolutionized software development, particularly in the domain of automated code generation. These models, built on architectures like the Transformer, have demonstrated remarkable capabilities in generating human-like text and source code, significantly enhancing developer productivity and reducing development time. However, the widespread adoption of LLMs for code generation raises concerns regarding the reliability, quality, and potential risks associated with the generated code. This article illustrates and analyzes the state of the art in evaluating LLM-generated code, summarizing research findings, and application areas. This paper highlights the challenges in distinguishing between machine-generated and human-written code, as well as the potential for LLMs to introduce security vulnerabilities and maintainability issues. We discuss the implications of these findings for both researchers and practitioners, emphasizing the need for continued research in the evaluation of LLM-generated code. Finally, we identify gaps in the literature and propose future research directions, such as the development of more robust benchmarks and improved evaluation metrics. By providing a thorough overview of the current landscape, this paper provides a valuable resource for researchers and practitioners interested in LLM’s code generation capabilities and limitations. We also highlight the importance of ongoing evaluation and refinement of these models to ensure their safe and effective integration into software-development practices.

Computer scienceProgramming languageCode (set theory)State (computer science)Natural language processing

View on Publisher Site

Evaluation of the Code Generated By Large Language Models: The State of the Art

Abstract

Keywords