SAST Implementation for Evaluating LLM-Generated Code Quality using Prompt Engineering

Muhammad Luthfi Abdillah, Tikaridha Hardiani

Abstract


The use of Large Language Models (LLMs) for generating programming code has become increasingly widespread; however, the quality of the generated output heavily depends on the instructions or prompts provided. This study aims to evaluate the influence of prompt engineering techniques on the quality of non-functional code generated by LLMs. The research employed a quantitative experimental approach involving five Python game development tasks using four prompt variations: zero-shot, few-shot, chain-of-thought, and role-based prompting. A total of 200 code snippets were analyzed using Static Application Security Testing (SAST) with the DeepSource tool to detect issues across seven categories: secrets, bug risk, anti-pattern, security, performance, style, and documentation. The results indicate that few-shot prompting produced the lowest total number of issues overall (1,328 out of 6,932 issues), demonstrating particular advantages in the anti-pattern and performance categories. However, this technique also recorded a higher number of critical issues (3 issues) compared to zero-shot and role-based prompting (1 issue each), indicating a trade-off between the overall volume of issues and the severity of certain issues. Role-based prompting generated the highest number of issues (2,516 issues), particularly in the style and documentation categories. This study recommends few-shot prompting as a foundational approach for AI-assisted software development and highlights the importance of integrating SAST into CI/CD pipelines to ensure code security and quality.

Keywords


code security; code quality; large language models; prompt engineering; static application security testing

Full Text:

PDF

References


S. Baltes et al., “Guidelines for Empirical Studies in Software Engineering involving Large Language Models,” Mar. 02, 2026, arXiv: arXiv:2508.15503. doi: 10.48550/arXiv.2508.15503.

DORA Team, “State of AI-Assisted Software Development,” Google Cloud, Research Report, 2025. [Online]. Available: https://services.google.com/fh/files/misc/2025_state_of_ai_assisted_software_development.pdf

R. Pandey, P. Singh, R. Wei, and S. Shankar, “Transforming Software Development: Evaluating the Efficiency and Challenges of GitHub Copilot in Real-World Projects,” Jun. 25, 2024, arXiv: arXiv:2406.17910. DOI: 10.48550/arXiv.2406.17910.

I. D. Fagadau, L. Mariani, D. Micucci, and O. Riganelli, “Analyzing Prompt Influence on Automated Method Generation: An Empirical Study with Copilot,” in Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension, Apr. 2024, pp. 24–34. DOI: 10.1145/3643916.3644409.

S. Gao et al., “The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation,” Jan. 02, 2025, arXiv: arXiv:2501.01329. DOI: 10.48550/arXiv.2501.01329.

Y. Fu et al., “Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study,” Feb. 06, 2025, arXiv: arXiv:2310.02059. DOI: 10.48550/arXiv.2310.02059.

“How We Ensure Less than 5% False Positive Rate • DeepSource,” DeepSource. [Online]. Available: https://deepsource.com/blog/how-deepsource-ensures-less-false-positives

A. S. Ami, K. Moran, D. Poshyvanyk, and A. Nadkarni, “‘False Negative -- that One is Going to Kill You’: Understanding Industry Perspectives of Static Analysis based Security Testing,” in 2024 IEEE Symposium on Security and Privacy (SP), May 2024, pp. 3979–3997. DOI: 10.1109/SP54263.2024.00019.

W. Peng, X. Wang, and Q. Wu, “ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas,” Feb. 04, 2026, arXiv: arXiv:2602.04296. DOI: 10.48550/arXiv.2602.04296.

S. Schulhoff et al., “The Prompt Report: A Systematic Survey of Prompt Engineering Techniques,” Feb. 2025, DOI: 10.48550/arXiv.2406.06608.

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing,” Jul. 28, 2021, arXiv: arXiv:2107.13586. DOI: 10.48550/arXiv.2107.13586.

H. Guo, “An Empirical Study of Prompt Mode in Code Generation based on ChatGPT,” Appl. Comput. Eng., Vol. 73, No. 1, pp. 69–76, Jul. 2024, DOI: 10.54254/2755-2721/73/20240367.

S. Anasuri, “Prompt Engineering Best Practices for Code Generation Tools,” Int. J. Emerg. Trends Comput. Sci. Inf. Technol., Vol. 5, No. 1, pp. 69–81, Mar. 2024, DOI: 10.63282/3050-9246.IJETCSIT-V5I1P108.

J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” Jan. 10, 2023, arXiv: arXiv:2201.11903. DOI: 10.48550/arXiv.2201.11903.

H. Louatouate and M. Zeriouh, “Role-based Prompting Technique in Generative AI-Assisted Learning: A Student-Centered Quasi-Experimental Study,” J. Comput. SCI. Technol. Stud., Vol. 7, No. 2, pp. 130–145, Apr. 2025, DOI: 10.32996/jcsts.2025.7.2.12.

E. Basic and A. Giaretta, “From Vulnerabilities to Remediation: A Systematic Literature Review of LLMs in Code Security,” Apr. 14, 2025, arXiv: arXiv:2412.15004. DOI: 10.48550/arXiv.2412.15004.

A. Sabra, O. Schmitt, and J. Tyler, “Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis,” Aug. 20, 2025, arXiv: arXiv:2508.14727. DOI: 10.48550/arXiv.2508.14727.

Y. Liu, R. Widyasari, Y. Zhao, I. C. Irsan, J. Chen, and D. Lo, “Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild,” 2026, arXiv. doi: 10.48550/ARXIV.2603.28592.

T. Hardiani, D. Wijayanto, and N. Latifah, “Data Security Analysis with OWASP Framework on Website XYZ,” CYBERNETICS, Vol. 6, No. 01, p. 10, Jul. 2022, DOI: 10.29406/cbn.v6i01.3953.

M. R. Syam Al’Am’yubi and D. Wijayanto, “Analisis Sistem Keamanan Website XYZ menggunakan Framework OWASP ZAP,” J. Ilmu Komput. JUIK, Vol. 3, No. 1, p. 1, Mar. 2023, DOI: 10.31314/juik.v3i1.1974.

D. Wijayanto and A. Firdonsyah, “Analisis Tingkat Resiko pada Website XYZ menggunakan Metode OWASP,” Digit. Transform. Technol., Vol. 4, No. 1, pp. 644–651, Aug. 2024, DOI: 10.47709/digitech.v4i1.4485.

M. Esposito, V. Falaschi, and D. Falessi, “An Extensive Comparison of Static Application Security Testing Tools,” Mar. 14, 2024, arXiv: arXiv:2403.09219. DOI: 10.48550/arXiv.2403.09219.

D. Tosi, “Studying the Quality of Source Code Generated by Different AI Generative Engines: An Empirical Evaluation,” Future Internet, Vol. 16, No. 6, p. 188, May 2024, DOI: 10.3390/fi16060188.

“GPT-5 Benchmarks and Analysis.” [Online]. Available: https://artificialanalysis.ai/articles/gpt-5-benchmarks-and-analysis

“GPT-5 High Reasoning Evaluation: A Major Leap in Coding Performance,” 16x Eval. [Online]. Available: https://eval.16x.engineer/blog/gpt-5-high-reasoning-coding-performance-evaluation




DOI: https://doi.org/10.32520/stmsi.v15i5.6395

Article Metrics

Abstract view : 0 times
PDF - 0 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.