Graphical Abstract Figure
Graphical Abstract Figure
Close modal

Abstract

This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data—including textual design requirements, CAD images, and engineering drawings—derived from the Formula SAE student competition. Unlike many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and the input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments—Rule Comprehension, Rule Compliance, and Rule Extraction—based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models (at the time of writing) like GPT-4o, GPT-4, Claude-Opus, Gemini-1.0, and LLaVA-1.5 against the benchmark, and our study uncovers the existing gaps in MLLMs’ abilities to interpret complex engineering documentation. The MLLMs tested, while promising, struggle to reliably retrieve relevant rules from the Formula SAE documentation, face challenges in recognizing technical components in CAD images and encounter difficulty in analyzing engineering drawings. These findings underscore the need for multimodal models that can better handle the multifaceted questions characteristic of design according to technical documentation. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at online.

References

1.
OpenAI, GPT-4V(ision) System Card, 2023, https://cdn.openai.com/papers/GPTV_System_Card.pdf.
2.
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
,
Uszkoreit
,
J.
,
Jones
,
L
,
Gomez
,
A. N.
,
Kaiser
,
Ł.
, and
Polosukhin
,
I.
,
2017
, “Attention Is All You Need,”
Advances in Neural Information Processing Systems
, Vol.
30
.
3.
Bubeck
,
S.
,
Chandrasekaran
,
V.
,
Eldan
,
R.
,
Gehrke
,
J.
,
Horvitz
,
E.
,
Kamar
,
E.
,
Lee
,
P.
,
Lee
,
Y. T.
,
Li
,
Y.
,
Lundberg
,
S.
, and
Nori
,
H.
,
2023
, “
Sparks of Artificial General Intelligence: Early Experiments With GPT-4
,” arXiv Preprint arXiv:2303.12712.
4.
Barbhuiya
,
R. K..
,
2023
, “
Introduction to Artificial Intelligence: Current Developments, Concerns and Possibilities for Education
,”
Ind. J. Edu. Technol.
,
5
(
2
), p.
266
.
5.
Thirunavukarasu
,
A. J.
,
Ting
,
D. S. J.
,
Elangovan
,
K.
,
Gutierrez
,
L.
,
Tan
,
T. F.
, and
Ting
,
D. S. W.
,
2023
, “
Large Language Models in Medicine
,”
Nat. Med.
,
29
(
8
), pp.
1930
1940
.
6.
Clusmann
,
J.
,
Kolbinger
,
F. R.
,
Muti
,
H. S.
,
Carrero
,
Z. I.
,
Eckardt
,
J. N.
,
Laleh
,
N. G.
,
Löffler
,
C. M. L.
,
Schwarzkopf
,
S. C.
,
Unger
,
M.
,
Veldhuizen
,
G. P.
, and
Wagner
,
S. J.
,
2023
, “
The Future Landscape of Large Language Models in Medicine
,”
Commun. Med.
,
3
(
1
), p.
141
.
7.
Kasneci
,
E.
,
Seßler
,
K.
,
Küchemann
,
S.
,
Bannert
,
M.
,
Dementieva
,
D.
,
Fischer
,
F.
,
Gasser
,
U.
,
Groh
,
G.
,
Günnemann
,
S.
,
Hüllermeier
,
E.
, and
Krusche
,
S.
,
2023
, “
ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education
,”
Learn. Indiv. Differ.
,
103
, p.
102274
.
8.
Makatura
,
L.
,
Foshey
,
M.
,
Wang
,
B.
,
HähnLein
,
F.
,
Ma
,
P.
,
Deng
,
B.
,
Tjandrasuwita
,
M.
,
Spielberg
,
A.
,
Owens
,
C. E.
,
Chen
,
P. Y.
, and
Zhao
,
A.
,
2023
, “
How Can Large Language Models Help Humans in Design and Manufacturing?
,” arXiv preprint arXiv:2307.14377.
9.
Picard
,
C.
,
Edwards
,
K. M.
,
Doris
,
A. C.
,
Man
,
B.
,
Giorgio
,
G.
,
Md Ferdous
,
A.
, and
Ahmed
,
F.
,
2023
, “
From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design
,” arXiv preprint arXiv:2311.12668.
10.
Ulrich
,
K. T.
, and
Eppinger
,
S. D.
,
2016
,
Product Design and Development
,
McGraw-Hill Education
,
New York
.
11.
Zeng
,
Y.
,
Zhang
,
H.
,
Zheng
,
J.
,
Xia
,
J.
,
Wei
,
G.
,
Wei
,
Y.
,
Zhang
,
Y.
, and
Kong
,
T.
,
2023
, “
What Matters in Training a GPT4-Style Language Model With Multimodal Inputs?
,” arXiv preprint arXiv:2307.02469.
12.
Wang
,
L.
,
Hu
,
Y.
,
He
,
J.
,
Xu
,
X.
,
Liu
,
N.
,
Liu
,
H.
, and
Shen
,
H. T.
,
2023
, “
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering
,” arXiv preprint arXiv:2305.03453.
13.
Liu
,
H.
,
Li
,
C.
,
Wu
,
Q.
, and
Lee
,
Y. J.
,
2024
, “Visual Instruction Tuning,”
Advances in Neural Information Processing Systems
, Vol.
36
.
14.
Shaham
,
U.
,
Ivgi
,
M.
,
Efrat
,
A.
,
Berant
,
J.
, and
Levy
,
O.
,
2023
, “
ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding
,” arXiv preprint arXiv:2305.14196.
15.
Xiong
,
W.
,
Liu
,
J.
,
Molybog
,
I.
,
Zhang
,
H.
,
Bhargava
,
P.
,
Hou
,
R.
,
Martin
,
L.
,
Rungta
,
R.
,
Sankararaman
,
K. A.
,
Oguz
,
B.
, and
Khabsa
,
M.
,
2023
, “
Effective Long-Context Scaling of Foundation Models
,” arXiv preprint arXiv:2309.16039.
16.
OpenAI. GPT-4o System Card, 2024, https://cdn.openai.com/gpt-4o-system-card.pdf.
17.
Song
,
B.
,
Zhou
,
R.
, and
Ahmed
,
F.
,
2024
, “
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
,”
J. Comput. Inf. Sci. Eng.
,
24
(
1
), p.
010801
.
18.
Dima
,
A.
,
Lukens
,
S.
,
Hodkiewicz
,
M.
,
Sexton
,
T.
, and
Brundage
,
M. P.
,
2021
, “
Adapting Natural Language Processing for Technical Text
,”
Appl. AI Lett.
,
2
(
3
), p.
e33
.
19.
Brundage
,
M. P.
,
Sexton
,
T.
,
Hodkiewicz
,
M.
,
Dima
,
A.
, and
Lukens
,
S.
,
2021
, “
Technical Language Processing: Unlocking Maintenance Knowledge
,”
Manuf. Lett.
,
27
, pp.
42
46
.
20.
Jiang
,
S.
,
Hu
,
J.
,
Magee
,
C. L.
, and
Luo
,
J.
,
2022
, “
Deep Learning for Technical Document Classification
,”
IEEE Trans. Eng. Manage.
,
71
, pp.
1163
1179
.
21.
Zhu
,
Q.
, and
Luo
,
J.
,
2023
, “
Generative Transformers for Design Concept Generation
,”
ASME J. Comput. Inf. Sci. Eng.
,
23
(
4
), p.
041003
.
22.
Richardson
,
M.
,
Burges
,
C. J. C.
, and
Renshaw
,
E.
,
2013
, “
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
,”
Conference on Empirical Methods in Natural Language Processing
,
Seattle, WA
,
Oct. 18–21
, pp.
193
203
.
23.
Rajpurkar
,
P.
,
Zhang
,
J.
,
Lopyrev
,
K.
, and
Liang
,
P.
,
2016
, “
Squad: 100,000+ Questions for Machine Comprehension of Text
,” arXiv preprint arXiv:1606.05250.
24.
Yang
,
Y.
,
Yih
,
W.-T.
, and
Meek
,
C.
,
2015
, “
Wikiqa: A Challenge Dataset for Open-Domain Question Answering
,”
Conference on Empirical Methods in Natural Language Processing
,
Lisbon, Portugal
,
Sept. 17–21
, pp.
2013
2018
.
25.
Dasigi
,
P.
,
Lo
,
K.
,
Beltagy
,
I.
,
Cohan
,
A.
,
Smith
,
N. A.
, and
Gardner
,
M.
,
2021
, “
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
,” arXiv preprint arXiv:2105.03011.
26.
Fu
,
C.
,
Chen
,
P.
,
Shen
,
Y.
,
Qin
,
Y.
,
Zhang
,
M.
,
Lin
,
X.
,
Qiu
,
Z.
,
Lin
,
W.
,
Yang
,
J.
,
Zheng
,
X.
,
Li
,
K.
,
Sun
,
X.
,
Wu
,
Y.
, and
Ji
,
R.
,
2023
, “
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
,” arXiv preprint arXiv:2306.13394.
27.
Liu
,
Y.
,
Duan
,
H.
,
Zhang
,
Y.
,
Li
,
B.
,
Zhang
,
S.
,
Zhao
,
W.
,
Yuan
,
Y.
,
Wang
,
J.
,
He
,
C.
,
Liu
,
Z.
, and
Chen
,
K.
,
2023
, “
MMBench: Is Your Multi-Modal Model an All-Around Player?
,” arXiv preprint arXiv:2307.06281.
28.
Yue
,
X.
,
Ni
,
Y.
,
Zhang
,
K.
,
Zheng
,
T.
,
Liu
,
R.
,
Zhang
,
G.
,
Stevens
,
S.
,
Jiang
,
D.
,
Ren
,
W.
,
Sun
,
Y.
, and
Wei
,
C.
,
2023
, “
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert Agi
,” arXiv preprint arXiv:2311.16502.
29.
Lu
,
P.
,
Mishra
,
S.
,
Xia
,
T.
,
Qiu
,
L.
,
Chang
,
K. -W.
,
Zhu
,
S. -C.
,
Tafjord
,
O.
,
Clark
,
P.
, and
Kalyan
,
A.
, “
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
,”
Conference on Neural Information Processing Systems
,
New Orleans, LA
,
Nov. 28–Dec. 9
, Vol. 35, pp.
2507
2521
.
30.
Chen
,
Y.
,
Hu
,
H.
,
Luan
,
Y.
,
Sun
,
H.
,
Changpinyo
,
S.
,
Ritter
,
A.
, and
Chang
,
M. -W.
,
2023
, “
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
,” arXiv preprint arXiv:2302.11713.
31.
Ferrari
,
A.
,
Spagnolo
,
G. O.
, and
Gnesi
,
S.
,
2017
, “
Pure: A Dataset of Public Requirements Documents
,”
IEEE 25th International Requirements Engineering Conference (RE)
,
Lisbon, Portugal
,
Sept. 4–8
, IEEE, pp.
502
505
.
32.
Chang
,
Y.
,
Wang
,
X.
,
Wang
,
J.
,
Wu
,
Y.
,
Yang
,
L.
,
Zhu
,
K.
,
Chen
,
H.
,
Yi
,
X.
,
Wang
,
C.
,
Wang
,
Y.
, and
Ye
,
W.
,
2023
, “
A Survey on Evaluation of Large Language Models
,” arXiv preprint arXiv:2307.03109.
33.
Wang
,
Y.
,
Kordi
,
Y.
,
Mishra
,
S.
,
Liu
,
A.
,
Smith
,
N. A.
,
Khashabi
,
D.
, and
Hajishirzi
,
H.
,
2022
, “
Self-instruct: Aligning Language Models With Self-Generated Instructions
,” arXiv preprint arXiv:2212.10560.
34.
2023 Federation Internationale de l’Automobile
, “
2024 Formula 1 Technical Regulations
,” https://www.fia.com/sites/default/files/fia_2024_formula_1_technical_regulations_-_issue_1_-_2023-04-25.pdf.
35.
Kovich
,
C. N.
,
2023
, “Human Landing System (HLS) Program Extravehicular Activity (EVA) Compatibility Interface Requirements Document (IRD),” Technical Report.
36.
Papineni
,
K.
,
Roukos
,
S.
,
Ward
,
T.
, and
Zhu
,
W. -J.
,
2002
, “
BLEU: A Method for Automatic Evaluation of Machine Translation
,”
Annual Meeting of the Association for Computational Linguistics
,
Philadelphia, PA
,
July 2002
, pp.
311
318
.
37.
Lin
,
C.-Y.
,
2004
, “Rouge: A Package for Automatic Evaluation of Summaries,”
Text Summarization Branches Out
,
Association for Computational Linguistics
,
Barcelona, Spain
, pp.
74
81
.
38.
Reimers
,
N.
, and
Gurevych
,
I.
,
2019
, “
Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks
,” arXiv preprint arXiv:1908.10084.
39.
Zhang
,
T.
,
Kishore
,
V.
,
Wu
,
F.
,
Weinberger
,
K. Q.
, and
Artzi
,
Y.
,
2019
, “
Bertscore: Evaluating Text Generation With Bert
,” arXiv preprint arXiv:1904.09675.
40.
Liu
,
H.
,
Li
,
C.
,
Li
,
Y.
, and
Lee
,
Y. J.
,
2023
, “
Improved Baselines With Visual Instruction Tuning
,” arXiv preprint arXiv:2310.03744.
41.
Lewis
,
P.
,
Perez
,
E.
,
Piktus
,
A.
,
Petroni
,
F.
,
Karpukhin
,
V.
,
Goyal
,
N.
,
Küttler
,
H.
,
Lewis
,
M.
,
Yih
,
W. T.
,
Rocktäschel
,
T.
, and
Riedel
,
S.
, “
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
,”
Advances in Neural Information Processing Systems
,
Vancouver, Canada
,
Dec. 6–12
, pp.
9459
9474
.
42.
Liu
,
J.
,
2022
, “
LlamaIndex
,” https://github.com/jerryjliu/llama_index.
43.
Li
,
H
,
Su
,
Y
,
Cai
,
D
,
Wang
,
Y.
, and
Liu
,
L.
,
2022
, “
A Survey on Retrieval-Augmented Text Generation
,’ arXiv preprint arXiv:2202.01110.
44.
Hu
,
E. J.
,
Shen
,
Y.
,
Wallis
,
P.
,
Allen-Zhu
,
Z.
,
Li
,
Y.
,
Wang
,
S.
,
Wang
,
L.
, and
Chen
,
W.
,
2021
, “
LoRA: Low-Rank Adaptation of Large Language Models
,” arXiv preprint arXiv:2106.09685.
45.
Dettmers
,
T.
,
Pagnoni
,
A.
,
Holtzman
,
A.
, and
Zettlemoyer
,
L.
,
2024
, “QLoRA: Efficient Finetuning of Quantized LLMs,”
Advances in Neural Information Processing Systems
, Vol.
36
.
46.
Sudalairaj
,
S.
,
Bhandwaldar
,
A.
,
Pareja
,
A.
,
Xu
,
K.
,
Cox
,
D. D.
, and
Srivastava
,
A.
,
2024
, “
Lab: Large-Scale Alignment for Chatbots
,” arXiv preprint arXiv:2403.01081.
47.
Anthropic
, “Prompt Engineering Overview,” https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview, Accessed August 23, 2024.
48.
McIntosh
,
T. R.
,
Susnjak
,
T.
,
Liu
,
T.
,
Watters
,
P.
, and
Halgamuge
,
M. N.
,
2024
, “
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence
,” arXiv preprint arXiv:2402.09880.
You do not currently have access to this content.