Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Filter by Categories
Book Review
Brief Report
Case Letter
Case Report
Case Series
Commentary
Current Issue
Editorial
Erratum
Guest Editorial
Images
Images in Neurology
Images in Neuroscience
Images in Neurosciences
Letter to Editor
Letter to the Editor
Letters to Editor
Letters to the Editor
Media and News
None
Notice of Retraction
Obituary
Original Article
Point of View
Position Paper
Review Article
Short Communication
Short Communications
Systematic Review
Systematic Review Article
Technical Note
Techniques in Neurosurgery
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Filter by Categories
Book Review
Brief Report
Case Letter
Case Report
Case Series
Commentary
Current Issue
Editorial
Erratum
Guest Editorial
Images
Images in Neurology
Images in Neuroscience
Images in Neurosciences
Letter to Editor
Letter to the Editor
Letters to Editor
Letters to the Editor
Media and News
None
Notice of Retraction
Obituary
Original Article
Point of View
Position Paper
Review Article
Short Communication
Short Communications
Systematic Review
Systematic Review Article
Technical Note
Techniques in Neurosurgery
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Filter by Categories
Book Review
Brief Report
Case Letter
Case Report
Case Series
Commentary
Current Issue
Editorial
Erratum
Guest Editorial
Images
Images in Neurology
Images in Neuroscience
Images in Neurosciences
Letter to Editor
Letter to the Editor
Letters to Editor
Letters to the Editor
Media and News
None
Notice of Retraction
Obituary
Original Article
Point of View
Position Paper
Review Article
Short Communication
Short Communications
Systematic Review
Systematic Review Article
Technical Note
Techniques in Neurosurgery
View/Download PDF

Translate this page into:

Original Article
16 (
4
); 595-605
doi:
10.25259/JNRP_401_2024

Development and content validity analysis of artificial intelligence-generated Indonesian language insomnia questionnaire based on the International Classification of Sleep Disorders, Third Edition

Department of Neurology, Faculty of Medicine, Universitas Airlangga, Universitas Airlangga Hospital, Surabaya, Indonesia.

*Corresponding author: Fidiana, Department of Neurology, Faculty of Medicine, Airlangga University, Surabaya, Indonesia. fidianaa@fk.unair.ac.id

Licence
This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-Share Alike 4.0 License, which allows others to remix, transform, and build upon the work non-commercially, as long as the author is credited and the new creations are licensed under the identical terms.

How to cite this article: Ikhtiar I, Fidiana, Islamiyah WR. Development and content validity analysis of artificial intelligence-generated Indonesian language insomnia questionnaire based on the International Classification of Sleep Disorders, Third Edition. J Neurosci Rural Pract. 2025;16:595-605. doi: 10.25259/JNRP_401_2024

Abstract

Objectives:

Application of artificial intelligence (AI) in the form of a large language model (LLM) in medicine, especially in the field of neurosomnology, is still limited. The generative ability of LLM can be used to develop a specific screening questionnaire in a specific language. In this study, we aimed to measure the ability of LLM to create a prototype, standardized questionnaire for the detection of insomnia in the Indonesian (ID) language based on the International Classification of Sleep Disorders, Third Edition (ICSD-3).

Materials and Methods:

Systematic prompting of generative pre-trained transformers 4.0-based LLM was done to create a Likert scale-based insomnia questionnaire set in ID. The Independent Expert Panel was established to assess the questionnaire items using Aiken V and content validity index (CVI) methods, as well as medical expert opinion for linguistic usage in the questionnaire.

Results:

LLM is able to generate a six-item insomnia questionnaire in ID. Content validity analysis with Aiken showed V value > 0.75 for all items. Question items number 1 and 2 yield a perfect score (V = 1), with the alternative form of item number 6 yielding the lowest score (V = 0.78). CVI analysis showed a value of 1 for scale-level CVI based on the average and a value of 1 for scale-level CVI based on the universal agreement, which indicates a high validity index. Independent feedback from the Expert Panel shows a requirement for linguistic revision despite evidence on content validity.

Conclusion:

AI is able to generate ID language and a Likert-scale insomnia questionnaire through a specified prompt based on ICSD-3. The AI-generated ID insomnia questionnaire is able to satisfy content validity criteria based on expert assessment.

Keywords

Artificial intelligence
Content validity
Insomnia questionnaire
Insomnia
Large-language model
Neurosomnology
Sleep medicine

INTRODUCTION

Insomnia is one of the sleep disorders with the highest prevalence and burden. Global prevalence of insomnia varies but is still considered high, ranging from 2.3% to 25.5%.[1] A study showed that the prevalence rate in Indonesia is higher, ranging from 33.3% with sub-threshold insomnia and 11.0% of clinically significant insomnia.[2] Despite the global magnitude of the problem, insomnia remains less reported into medical centers, with various local studies reporting that only 39.8% patients with insomnia symptoms consulted a general practitioner[3] and only 25.6% patients with chronic insomnia symptoms have active health-seeking behavior for their problem.[4] As a result, there are unmet needs between insomnia diagnosis and its proper treatment. Hence, a set of easy self-assessment questions to detect insomnia symptoms and to alert its user to seek medical help is required to fill the gap. To solve this problem, several questionnaires have been invented to bridge the diagnostic gap between health professionals and patients. The ever-changing insomnia criteria over the years further challenge the validity of the questionnaire. To date, there are at least two most established insomnia diagnostic criteria: Diagnostic and Statistical Manual, Fifth Edition (DSM-5) and International Classification of Sleep Disorders, Third Edition (ICSD-3).[5] The original ICSD-3 is currently adopted by the Indonesian (ID) Neurology Association (Perhimpunan Dokter Neurologi Seluruh Indonesia [PERDOSNI])[6] and has the potential to be developed into a local screening questionnaire. However, currently, in Indonesia, an ICSD-3-based insomnia questionnaire available in the local language does not exist. This prompts the additional need to develop a questionnaire in the local language beside of satisfying the currently accepted diagnostic criteria.

Artificial intelligence (AI) is a computational technology aimed to simulate human intelligence in doing tasks requiring the complicated nature of human intelligence such as problem solving, decision making, and humane communication.[7] There is an AI function specifically designed to understand and analyze human language, which is called natural language processing (NLP).[8] A large language model (LLM) is a form of AI, trained on a vast amount of linguistic data and capable of applying NLP.[9] The LLM is known to be able to bypass medical language barrier and has been proposed to address the translational issue with non-native English (EN)-speaking patients.[10] However, the ability of LLM being able to create a valid questionnaire under strict, specified, and standardized criteria is still being explored. Hence, in this study, we try to explore the AI ability, in the form of LLM, into creating an ICSD-3-based insomnia screening questionnaire in ID.

MATERIALS AND METHODS

AI

The AI used in this study is in the form of LLM. The LLM used in this study is a generative pre-trained transformer (GPT), specifically GPT-4.0 made by OpenAI (OpenAI, California, USA), which is available as a free Microsoft Copilot service (Microsoft, Washington, USA). Microsoft Copilot GPT-4.0 was chosen due to its ability to cite external resources by performing on-demand website search, generating accurate and contextual responses within seconds, having a newer dataset compared to GPT-3.5, and highly accessible, especially as a free service without additional account login requirement (during this study process, only GPT-3.5 was free as an OpenAI service).

Translation, review, adjudication, pretest, and documentation (TRAPD) and TRAPD homolog

The workflow of this study is homologous based on the conventional questionnaire translation method of TRAPD, which is followed by psychometric analysis of criterion validity, construct validity, and reliability.[11,12] Traditionally, the initial Translation and Review process was done by a clinical Expert Panel. The results were then consulted with the linguistic Expert Panel who provided back translation. The back translation result was then resubmitted to the clinical Expert Panel for feedback response. These whole feedbacks between clinical and linguistic experts are summarized as the Adjudication process. In the pre-test stage, the final questionnaire then pre-tested to achieve data. Documentation is an important stage in which data and process transparency can be audited. This whole conventional TRAPD process is summarized in Figure 1 (blue shade).

Comparison between large language model (LLM)-based questionnaire development and conventional translation, review, adjudication, pretest, and documentation (TRAPD)-based questionnaire translation process. The questionnaire generation process used in this study is homolog to the conventional TRAPD approach. The translation and review process being bypassed entirely by LLM is the main difference between TRAPD and this study’s methods. Early grammatical revisions in the LLM workflow can be seen as a prompter task with expert panel supervision. In TRAPD, the content validity of a novel questionnaire item, including a translated version, can be assessed as early as the review process by clinical experts has been done. The criterion validity, construct validity, and reliability instead require at least pretest results from target respondents, which are not covered in this study. Color code: Orange shade: The scope of the current study; Gray shade: Part of TRAPD homolog which is not the scope of the current study; Blue shade: Conventional TRAPD process; Purple: Position of human profession/task relative to each process. The color code is based on Prompter tasks on the TRAPD analogue for LLM-based questionnaire development process: Dark orange: Idea grounding, Light orange: Construct engineering, Light yellow: Questionnaire realization, Pale yellow: Grammatical revision
Figure 1:
Comparison between large language model (LLM)-based questionnaire development and conventional translation, review, adjudication, pretest, and documentation (TRAPD)-based questionnaire translation process. The questionnaire generation process used in this study is homolog to the conventional TRAPD approach. The translation and review process being bypassed entirely by LLM is the main difference between TRAPD and this study’s methods. Early grammatical revisions in the LLM workflow can be seen as a prompter task with expert panel supervision. In TRAPD, the content validity of a novel questionnaire item, including a translated version, can be assessed as early as the review process by clinical experts has been done. The criterion validity, construct validity, and reliability instead require at least pretest results from target respondents, which are not covered in this study. Color code: Orange shade: The scope of the current study; Gray shade: Part of TRAPD homolog which is not the scope of the current study; Blue shade: Conventional TRAPD process; Purple: Position of human profession/task relative to each process. The color code is based on Prompter tasks on the TRAPD analogue for LLM-based questionnaire development process: Dark orange: Idea grounding, Light orange: Construct engineering, Light yellow: Questionnaire realization, Pale yellow: Grammatical revision

The TRAPD homolog is an attempt in this study to implement AI in the form of LLM into the conventional TRAPD method. In this study, the GPT-4.0 is tasked to specifically perform only the translation and review stage, bypassing the clinical expert role in the conventional TRAPD method. In this study, we try to minimize the human input on the LLM to achieve a questionnaire product that is ideally independent from direct human influence. The minimization is achieved by segregating the human and LLM roles and by segregating expert influence during the questionnaire creation process. This minimization resulted in a study workflow homologous to the TRAPD [Figure 1, orange shade].

The role of LLM is limited to initial Translation and Review based on this study aim. Due to the nature of the original TRAPD where language expert role is started in the Adjudication stage and the requirement of satisfying content validity from the previous stages, the TRAPD homolog in this study did not involve linguistic evaluation, e.g., back translation and linguistic validity analysis [Figure 1, gray shade]. With consideration where the LLM-generated item content validity is unknown, then the adjudication, pre-test, and following psychometric analysis of criterion validity, construct validity, and reliability were not performed until this study produced enough evidence to initiate the mentioned stages.

Human panel

The human panel consists of a Prompter and an Expert Panel. Both Prompter and Expert Panel are members of PERDOSNI. The appointed Expert Panel member is a certified neurosomnology, lecturer in the medicine faculty and/or supervisor in teaching hospital and must be acknowledged by PERDOSNI as a part of the PERDOSNI sleep disorder study group. All appointed Expert Panel members have served in neurology in Indonesia for more than 10 years of experience; Expert Panel Member 1 has served in neurology for 15 years, Expert Panel Member 2 for 13 years, and Expert Panel Member 3 for 14 years.

Prompter is tasked to design and create prompts used in questionnaire generation. The prompt for the questionnaire generation was designed by a prompter based on insomnia diagnostic criteria on ICSD-3 and sleep questionnaire quality standard as suggested by Klingman et al.[13] Prompter is allowed to ask LLM to revise the generated questionnaire but not allowed to specifically dictate the required revision (for example: “Revise the answers so it can be more understandable by common folks” is allowed, but “revise the answer A sentence into X” is not). Hence, the LLM must use its full potential without human assistance during the questionnaire generation process.

For the Expert Panel, three ID neurosomnologist were appointed to assess the generated questionnaire using Aiken V and the content validity index (CVI) method. Before the assessment, Expert Panel members were informed about the ICSD-3 diagnostic criteria as a quality standard. All Expert Panel members have neither access nor intervention rights during the questionnaire generation process to ensure an independent production of the insomnia questionnaire by LLM. All Expert Panel members have the right to revise and intervene only after the questionnaire item generation process. To ensure that only pure performance of LLM is validated, only questionnaire items that have not undergone expert revision are included in the content validity analysis.

Questionnaire generation process

The questionnaire generation process is designed to be independent from expert panel influence. The independence is meant to measure LLM baseline performance in generating products of unknown concepts. To ensure the LLM’s readiness for questionnaire generation, the prompter induced four phases for the LLM to pass for questionnaire generation. The first phase is idea grounding. The idea grounding phase goal is to ensure that the LLM already understands that insomnia is the “idea”, the main topic being discussed. There is no mention of the questionnaire-making idea in this idea grounding phase to avoid topic understanding failure as well as to meet the questionnaire quality standard. The second phase is construct-engineering. Construct-engineering goal is to further specify the LLM answer into the insomnia construct based on ICSD-3. The construct engineering phase also assesses which diagnostic criteria of insomnia are currently being used in LLM. The third phase is questionnaire realization. In this phase, the goal is to create a set of insomnia questionnaire based on the topic and construct prompted. The fourth phase is the grammatical revision phase. In this study, this phase goal is to revise the questionnaire grammar and style inconsistencies. Technical terms were also revised in this phase. All prompts used in all phases were in ID. All LLM output produced in all phases was also in ID. The entire workflow encompassing the questionnaire generation process and its comparison with the TRAPD process can be summarized in Figure 1.

Content validity analysis

Content validity was measured using Aiken V and CVI with the modified Kappa method.[14-16] The purpose of the Aiken method is to calculate the V value. The V value measures the validity coefficient for each questionnaire item (Vi) and for each rater (Vr).

For Vi, the V value is calculated by the formula:

Vi = ∑Si/(n[C-1])

While V value in Vr is calculated by the formula:

Vr = ∑Sr/(m[C-1])

The rater will give each questionnaire item a rate (R) based on a Likert scale, scaled from 1 (strongly disagree) to 4 (totally agree). Then S is calculated by subtracting between R and lowest rate (Lo) and ∑S is the total of S between raters (∑Si) or for each rater (∑Sr). The n is the number of raters, which is equal to the number of expert panel members. The rate count (c) is determined by the amount of answer rate provided by each item. The generated questionnaire was expected to have at least six questions based on ICSD-3 insomnia diagnostic criteria. Hence, the m value based on the Aiken V table is 6.[15] Based on that information and the Aiken V table, a questionnaire item meets content validity criteria when 0.75 ≤ V ≤ 1 (P < 0.05).

The CVI with modified kappa statistics is a modified CVI with an adjustment to fix the chance agreement issue.[16] It consists of item-level CVI (I-CVI), scale-level CVI based on the average (S-CVI/Ave), and scale-level CVI based on the universal agreement (S-CVI/UA).

The I-CVI measures the proportion of experts giving an item a high relevance rating. I-CVI describes validity per item rather than per questionnaire set. It is calculated by the formula:

I-CVI = na/n

The number of expert-in-agreement (na) is the number of experts who score at a high rate (score 3–4). The I-CVI value for 3 experts to satisfy content validity should be 1.

The S-CVI/Ave measures the average of I-CVI scores for all items. The S-CVI/Ave describes the validity of a questionnaire as a set rather than per item. The S-CVI/Ave is calculated by the formula:

S-CVI/Ave = ∑I-CVI/m

Where ∑I-CVI is the sum of I-CVI and m is the number of all items measured. The S-CVI/Ave value for 3 experts to satisfy content validity should be 1.

The S-CVI/UA measures the proportion of items that score high rate (score 3–4). The S-CVI/UA describes the validity of a questionnaire based on universal score agreement by expert panels. The S-CVI/UA is calculated by the formula:

S-CVI/UA = UA/m

Where UA is the total items reaching universal agreement and m is the total number of all items measured. The S-CVI/Ua value for 3 experts to satisfy content validity should be 1.

Post-generation quality analysis

Post-generation quality analysis was done after content validity analysis to measure LLM-ability to create a proper questionnaire without any intervention from human experts. Post-generation quality analysis was done by the Expert Panel. The Expert Panel was given full freedom to revise the questionnaire from grammar to the questionnaire structure. The Prompter has no contribution in this phase.

RESULTS

Questionnaire generation

The GPT-4.0 LLM went into four phases of prompt input to create the questionnaire. The idea grounding phase was passed successfully without revision. The prompts and responses log indicates the LLM’s ability to refer from validated sources (e.g. American Association of Sleep Medicine website for insomnia) as well as popular writing sources. The construct engineering phase showed an indication of a major lack of ICSD-3 diagnostic criteria in the LLM training dataset. Initially, the LLM failed to mention ICSD-3 diagnostic criteria for insomnia correctly even after four revisions under the same or similar prompts (repeated prompt used: ID: Jawaban [masih] salah, mohon perbaiki berdasarkan kriteria diagnostik insomnia sesuai ICSD-3; EN: [Still] Wrong answer, please correct it based on ICSD-3 diagnostic criteria of insomnia). Manual input of the complete ICSD-3 insomnia criteria text was required to train the LLM to mention the correct ICSD-3 construct. The correct answer, however, is showing a tendency to summarize criteria A and B for both short-term and chronic insomnia, resulting in loss of criteria detail. One revision is required to ensure that LLM is able to repeat the ICSD-3 perfectly (prompt: ID: Coba jelaskan secara mendetail seperti yang telah saya tulis sebelumnya, jangan meringkas detail tersebut. Pertahankan penulisan kriteria A-E untuk insomnia jangka pendek dan kriteria A-F untuk insomnia kronis seperti ICSD-3 yang saya tulis; EN: Try to explain as in detailed manner as I have written before, (and) do not summarize the details. Keep the writing of criteria A-E for short-term insomnia and criteria A-F for chronic insomnia as the ICSD-3 I had written). The whole processes of idea grounding are summarized in Table 1.

Table 1: Summary of questionnaire generation process.
No. Phase Initial prompt Revisions needed Outcome
1. Idea grounding ID: Mari kita membahas mengenai insomnia secara ilmiah sesuai ICSD-3
EN: Let’s talk about insomnia scientifically, based on ICSD-3
0 Correct definition of insomnia based on ICSD-3
2. Construct engineering ID: Coba jelaskan kriteria diagnostik sesuai ICSD-3 untuk insomnia jangka pendek dan kronis
EN: Please explain to me about the diagnostic criteria according to ICSD-3 for short-term and chronic insomnia
4 Failure for LLM to mention the chronic and short-term insomnia diagnostic criteria based exactly on ICSD-3
ID: Akan saya koreksi jawaban tersebut dengan dataset sesuai dengan definisi insomnia menurut ICSD-3 yang sebenarnya, yaitu seperti yang tertulis di bawah ini (diikuti dengan daftar kriteria diagnostik ICSD-3 ID)
EN: I will correct those answers with correct dataset according to ICSD-3, as written below (followed by ICSD-3 list of diagnostic criteria in ID)
1 Ability for LLM to mention chronic and short-term diagnostic criteria based on ICSD-3
3. Questionnaire Realization ID: Sesuai dengan kriteria A-E ICSD-3 untuk insomnia jangka pendek, dan kriteria A-F ICSD-3 untuk insomnia kronis seperti yang telah dijelaskan tadi, buatkan suatu kuesioner skrining insomnia berdasarkan skala Likert yang dapat digunakan untuk membantu diagnosis ada dan tidaknya insomnia, dan membantu pengisi kuesioner untuk untuk menilai gejala insomnia. Skala Likert tersebut upayakan dapat menjelaskan angka atau jumlah hari, bukan hanya frekuensi kualitatif seperti “sering, jarang, kadang-kadang, dsb.” Buatlah sedemikian rupa agar jumlah pertanyaannya optimal, dapat mewakili seluruh kriteria diagnostik insomnia ICSD-3 yang telah dijelaskan.
EN: In accordance with ICSD-3 criteria A-E for short-term insomnia, and ICSD-3 criteria A-F for chronic insomnia as described earlier, develop an insomnia screening questionnaire based on a Likert scale that can be used to help diagnose the presence or absence of insomnia, and help the questionnaire taker to assess insomnia symptoms. The Likert scale should be able to describe a number of days, not just a qualitative frequency such as “often, rarely, sometimes, etc.” Make it so that the number of questions is optimal, representing all the ICSD-3 insomnia diagnostic criteria described.
7
  • Ability for LLM to create a Likert-scale questionnaire based on the ICSD-3 construct

  • Ability for LLM to revise questionnaire item answers to match a minimum of three answers in Likert scale style

4. Grammatical Revisions Various prompts were adjusted for grammatical purposes, especially word choices and simplicity 4 LLM-generated insomnia questionnaire in ID with grammatical and non-construct revisions ala LLM.

EN: English version of prompt based on Prompter translation, ICSD-3: International Classification of Sleep Disorders, Third Edition, ID: Indonesian (original input), LLM: Large Language Model. Content rows are color-coded based on the LLM workflow in Figure 1. The color code is based on Prompter tasks on the TRAPD analogue for LLM-based questionnaire development process: Dark orange: Idea grounding, Light orange: Construct engineering, Light yellow: Questionnaire realization, Pale yellow: Grammatical revision

The questionnaire realization phase starts with a long, detailed prompt presented in Table 1. The LLM is able to create a Likert-scale ICSD-3-based questionnaire by citing the Insomnia Sleep Questionnaire (ISI) design. The questionnaire realization phase required revisions to create a questionnaire with a consistent answer style. During the questionnaire realization phase, there is also a tendency for LLM to summarize and over-generalize the questions. This requires revision to ensure that LLM still keeps the main idea of the ICSD-3 criteria. The whole processes of questionnaire generation are summarized in Table 1. Comparison between the complete LLM-generated questionnaire items, ISI items, and their relation with ICSD-3 constructs as well as their correlation with workflow color code in Figure 1, is presented in Table 2.

Table 2: Comparison between the ICSD-3 construct and its respective LLM-generated items and ISI items.
ICSD-3 constructs LLM-generated items ISI Items
Difficulty on initiating, maintaining, early waking, and sleep resistance (Criteria A in Short Term and Chronic insomnia Disorder) ID:
• Seberapa sering Anda mengalami kesulitan memulai tidur di malam hari?
• Seberapa sering Anda mengalami kesulitan mempertahankan tidur di malam hari (misalnya, sering terbangun)?
EN:
• How often do you experience difficulty initiating sleep at night
• How often do you experience difficulty maintaining night sleep (for example: Frequent waking up)
EN:
Please rate the current (i.e., last 2 weeks) severity of your insomnia problem(s):
• Difficulty falling asleep
• Difficulty staying asleep
• Problems waking up too early
• How satisfied/dissatisfied are you with your current sleep pattern?
Daytime functional impairments (Criteria B in Short-Term and Chronic Insomnia Disorder) ID:
• Seberapa sering gangguan tidur mengganggu aktivitas sehari-hari Anda (misalnya, pekerjaan, hubungan sosial, atau kualitas hidup)?
EN:
• How often do your sleep problems impair your daily activities (e.g., work, social interaction, or quality of life)?
EN:
• How noticeable to others do you think your sleep problem is in terms of impairing the quality of your life?
• How worried/distressed are you about your current sleep problem?
• To what extent do you consider your sleep problem to interfere with your daily functioning (e.g., daytime fatigue, mood, ability to function at work/daily chores, concentration, memory, mood, etc.) currently?
Adequacy of sleep opportunity and environmental factors (Criteria C in Short-Term and Chronic Insomnia Disorder) ID:
• Apakah kesulitan tidur Anda dapat dijelaskan oleh faktor eksternal seperti kurangnya kesempatan atau lingkungan tidur yang tidak memadai?
EN:
• Can your sleep difficulty be explained by external factors, such as inadequate chance or environment for sleep?
N/A
Symptoms duration and/or frequency (Criteria D in Short Term and Criteria D and E in Chronic insomnia Disorder) ID:
• Seberapa sering Anda mengalami kesulitan memulai tidur di malam hari?
• Berapa lama Anda telah mengalami kesulitan tidur ini? EN:
• How often do you experience night sleep difficulty? (repeat of first item)
• How long have you been experiencing this sleep difficulty?
N/A
Exclusion criteria and/or differential diagnosis (Criteria E in Short Term and Criteria F in Chronic Insomnia Disorder) ID:
• Apakah Anda memiliki gangguan tidur lain yang dapat menjelaskan gejala ini (misalnya, kaki gelisah, gangguan pernapasan saat tidur, atau gangguan tidur lainnya)?
EN:
• Do you have other sleep disorder that can explain your current complaints (e.g., restless leg syndrome, breathing difficulty, or other sleep problems?)
N/A

EN: English version based on Prompter translation, ICSD-3: International Classification of Sleep Disorders, Third Edition, ID: Indonesian (original result), ISI: Insomnia Severity Index, LLM: Large Language Model, N/A: Not applicable

Color code: blue shade: Insomnia diagnostic criteria based on ICSD-3; yellow shade: List of LLM-generated items based on ICSD-3 criteria on the same row; orange shade: List of ISI items and its comparison with ICSD-3 criteria in the same row

Content validity analysis

Content validity analysis by an expert panel using the Aiken V method showed that the LLM-generated insomnia questionnaire fulfilled the content validity criteria. Per-item analysis, questionnaire items No. 1 and No. 2 showed the highest Vi-value on Vi = 1 (P < 0.05). Item No. 6 showed the lowest V-value with Vi = 0,778 (P < 0.05). Per-rater analysis showed that all raters declared content validity with V value fulfilling 0.83 ≤ Vr ≤ 1. The detailed rate given per item and its respective V values are presented in Table 3.

Table 3: Aiken V calculation result.
m=6 =3 S=R – Lo Si Vi
Item no. R1 R2 R3 S1 S2 S3
1. 4 4 4 3 3 3 9 1
2. 4 4 4 3 3 3 9 1
3. 3 4 4 2 3 3 8 0.89
4. 4 3 4 3 2 3 8 0.89
5. 4 3 4 3 2 3 8 0.89
6. 3 3 4 2 2 3 7 0.78
∑Sr 16 15 18
Vr 0.89 0.83 1

R1: Rate given from Expert Panel Member 1, R1: Rate given from Expert Panel Member 2, R1: Rate given from Expert Panel Member 3, S=R – Lo: The amount of rate minus lowest score possible, S1=S obtained from Expert Panel Member 1, S2=S obtained from Expert Panel Member 2, S3=S obtained from Expert Panel Member 3, ∑Si: Total S across all items, ∑Sr: Total S for each rater, Vi: Aiken V value for each item, Vr: Aiken V value for each rater

Analysis using the CVI method also indicates that the LLM-generated insomnia questionnaire fulfilled the content validity criteria. The agreement rate is na = 3 for each item and UA = 6. The I-CVI value of every item is 1, with S-CVI/Ave = 1, whereas UA for indicating universal agreement by the Expert Panel is 1 for each item, with S-CVI/UA = 1.

Post-generation quality analysis

Post-generation quality analysis was done after content validity analysis. There are qualitative revisions by the Expert Panel for items No. 4 and No. 6. For item No. 4, experts suggest directly dividing the symptom duration between chronic and short-term insomnia, resulting in three answer categories (never, less than 3 months, and more than 3 months) instead of a time interval. The Expert Panel suggests heavier qualitative revisions for Item No. 6. Item No. 6 required to change all literal and technical disease names into more identifiable symptoms for common folks to recognize, for example: Kesulitan bernapas (EN: breathing difficulty) into tersedak saat tidur (EN: Choking while sleeping) or mendengkur (EN: Snoring), as well as minor grammatical changes for the answer lists. There is no revision requiring the deletion or addition of a new question.

DISCUSSION

This study demonstrates the ability of an LLM in creating an insomnia questionnaire with a prompt engineered for a certain topic. The performance of an LLM in creating a specified questionnaire has been explored in several studies and has produced satisfying results. A study has compared the medical-examination questionnaire performance between GPT-based LLM and humans, in which no statistically significant difference was found in item difficulty between both makers.[17] There are even comparisons between LLMs in the making of a medical-examination questionnaire with significant differences between validity and difficulty on each model,[18] suggesting that difference in performance is caused by different training datasets and LLM architecture. The completeness of training data has been a major issue for LLM performance, especially in the field of medicine. This prompts the need for manual, human-involved benchmarking for LLM regarding potential performance issues, such as uncertainty and data presentation accuracy.[19]

The LLM in this study is able to cite ISI and generate ICSD-3-based questionnaire items, although it cannot initially identify ICSD-3 diagnostic criteria for insomnia. It is important to notice that although the LLM in this study is able to cite ISI, the comparison between the coverage of the ICSD-3 construct in the LLM-generated and ISI items is noticeably different [Table 2]. The ISI is mainly based on DSM-Fourth Edition (DSM-IV), one of several classification systems that describe insomnia.[20] The ever-changing insomnia diagnostic criteria provide a basis for various classification systems and screening instrument development. For example, the Athens Insomnia Scale (AIS) is based on the International Classification of Diseases, Tenth Edition,[21] ISI is based on DSM-IV,[20] Sleep Condition Indicator (SCI) is based on DSM-5,[22] and Insomnia Screening Scale (ISS) is based on ICSD-2.[23] The DSM is the earliest to describe insomnia diagnostic criteria by comparing it to other mental disorders. The DSM was revised in 2022 as DSR-5-Text Revision (DSM-5-TR), and it contains eight diagnostic criteria for insomnia. The ICSD-3 was released in 2014, 8 years earlier than DSM-5-TR.[24] Compared to DSM-5, the original ICSD-3 does not segregate insomnia from sleep difficulty caused by mental illness, drug and substance abuse, or organic comorbidities. The segregation was only applied in the text revision (ICSD-3-TR) in 2023 under one criterion only.[25] In a coincidental manner, this recent segregation may help explain our finding which is the expert panel’s relatively lower Aiken V score for the sixth item, which is in turn generated based on pre-revised ICSD-3. Based on these evolving criteria, the LLM in this study has been shown to be qualitatively able to produce an instrument containing questions needed to fulfill the data required for the most recent diagnostic criterion – the ICSD-3-TR. Compared to all of the previously mentioned instruments, the LLM-generated questionnaire also demonstrated fewer questions (7 items) compared to AIS (8 items),[21] SCI (8 items),[22] and ISS (26),[23] while on par with ISI (7 items).[20] Limitations also observed relative to all other mentioned questionnaires, such as uneven generation of Likert scale between each item as well as the decision of the LLM to explain how the total score should be calculated but not explaining how the score should be categorized, leaving the questionnaire open to further validity analysis.

This study demonstrates the ability of LLM in creating a questionnaire directly in ID language. Since the LLM is mainly trained in EN, a combination between the conventional questionnaire development method and the conventional instrument translation method should be used. Because this study does not involve a human expert in the questionnaire generation process, a process dedicated to content validation is mandatory. Hence, we deploy similar methods used in the development of recovering quality of life measure instrument. The methods revolve around four processes which are (1) generation of candidate questionnaire items, (2) content validity assessment, (3) psychometric evaluation, and (4) selection of final questionnaire items.[26] Item generation traditionally requires experts across multidisciplinary fields to ensure the content validity, as demonstrated in the creation of the Adolescent Insomnia Questionnaire.[27] However, in this study, the questionnaire item generation phase is handled solely by AI with an unknown dataset completeness without expert intervention. Studies showed that item generation and data extraction under non-standardized prompts and unknown LLM dataset transparency will lead to accuracy, consistency, and reliability issues.[28-30] Combined with our findings, we suggest the need for prior LLM dataset confirmation during item generation, which in this study was realized as the deployment of specific questionnaire generation phases in this study.

The use of auto-translation-capable LLM in this study provides insight into how we should address the foreign instrument translation issue in the era of AI. Questionnaire translation is usually done conventionally by performing six different stages encompassing five essential processes of TRAPD,[11,12] in which only the first two stages are covered in this study. The first stage, “Translation”, traditionally requires language translation by native and bilingual speakers, which is bypassed in this study by AI. The second stage, “Review”, requires an expert round to assess the content validity and to modify questionnaire wording, which is proven to be unskippable even in this study as evidenced by multiple revisions during questionnaire generation.

Aside from content validity, linguistic validity is also important to measure comprehensiveness when the questionnaire is presented into respondents. Linguistic validity affects the degree of understanding of respondents when filling out the screening questionnaire due to sociocultural variations in society. Studies showed that every questionnaire in development requires its own validation by a linguist.[31,32] Linguistic validity can be measured using a simple Likert scale, involving bilingual experts[33] as well as real target respondents.[34] Considering the unknown or parameter of linguistic validity of AI and AI products, the role of linguists is still needed in parallel to the content validation process.

Content validity of an instrument can be examined from several methods, such as Aiken V and CVI, which all require involvement of a human expert for critical assessment.[15,16] Content validity of non-ICSD-3-based insomnia questionnaire has been reported previously. The most prominent insomnia screening instrument for professionals, the Pittsburgh Sleep Quality Index, has S-CVI/Ave of 0.905, which indicates a good overall content validity.[35] The Holland Sleep Disorders Questionnaire has an item CVI between 0.83-1, suggesting a high validity value.[36]

Content validity is often not deeply explored because the conventional instrument creation process already directly involves experts in their field. This lack of exploration has been recorded in several studies. For example, a systematic review on the structural validity of ISI showed that validity component is mostly explored as internal consistency, test– retest reliability, construct validity, and diagnostic validity, but not content validity.[37] However, in the age of AI capable of creating similar instruments with unknown source transparency, analysis of content validity has become more relevant than ever.

The use of LLM for creating a novel clinical questionnaire is not without ethical concern. As demonstrated in this study, LLM claims to create novel questionnaire items based on ISI. The actual ISI, however, only covers Criteria A and Criteria B of chronic and short-term insomnia based on ICSD-3. This is because ISI itself is based on totally different insomnia diagnostic criteria, the DSM-IV.[37,38] There are clearly questionnaire choices with more updated criteria or even a straight ICSD-based choice such as SCI or ISS; however, the GPT-4.0 reason behind choosing ISI as the first answer remains in question. This finding indicates a major bias in LLM output. It is worth noting that the LLM itself cannot be free from bias, since bias is the natural state of any machine incorporating a neural network. This “natural bias” can be explained by the iteration of a random number sequence in a specific length known as the “seed”, which is the staple method used by machine learning workflows.[39-41] Each number value and position in the seed provides “weight” relative to the input given, and this further dictates the following answer spectrum.[42] This answer spectrum was then correlated in the neural network hidden layer to generate the final output, resulting in dynamic but natural-looking responses. This initial randomization, however, becomes a major problem when using machine learning as a scientific tool because it may reduce,[43] inflate,[39,40] or even eliminate scientific reproducibility, unless the exact initial seed is known.[44] The neural network hidden layer itself is another ethically controversial subject. A hidden layer is a specified and correlated linear network of a trained dataset to recognize given input.[45] Concerns had risen from multiple societies and disciplines given the nature of these trained datasets, especially in LLM without clear data training disclosure.[46-49] These include but are not limited to how the data are obtained, what happened to licensed data, how the data can be recognized in the hidden layer, how the data are trained, and how the data are validated before the training.[47,50-52] Due to the nature of machine learning, these trained data are also able to update themselves based on further response to the output, creating substantially different and more desirable output – hence the learning part of the machine learning.[42] This learning capability, combined with a hidden seed, may further complicate the former bias problem, since under the learning influence, the output is updated too, technically eliminating exact output reproducibility.[40] Further complicating this problem is the nature of the dataset used in LLM training. A review in 2024 showed that an incomplete dataset may affect LLM output due to bias caused by imbalanced sample size, non-random missing data, uncaptured/not-easily-available data, misclassification during dataset labeling and input, and imbalanced origin of data publication (racial and ethnicity bias).[53] While these ethical issues can be avoided by creating custom-made, task-specific LLM using open-source options, the rarity of expertise and resources currently creates difficulties to avoid the use of more accessible but unspecialized LLM with questionable data disclosure. In the dataset completeness context, several technical strategies have been proposed to mitigate this bias problem. The strategies include but are not limited to minority data oversampling, data augmentation, usage of a standardized reporting checklist, avoiding deliberate acts of filling missing data (non-imputation), compliance with unified expert consensus, and usage of standardized data criteria for predictive purposes.[53] Based on our findings and the existing evidence, human intervention in the form of an expert validator is still required to mitigate biases that may happen in LLM-generated products such as a diagnostic-screening questionnaire.

There are several weaknesses limiting this study. The first is the limitation on the content validity analysis only. A good questionnaire should satisfy not only content validity but also construct validity, criterion validity, and reliability, which is not evaluated in this study. Further investigation involving a population of respondents addressing initial psychometric data covering construct validity, criterion validity, and reliability is required to fully implement this questionnaire. Second, although this study covers local expert opinions on the auto-translated revision, this study also did not consider linguistic evaluation for the addressed linguistic validity. Based on the Adjudication phase in the TRAPD homolog, we deemed that linguistic evaluation of the LLM product is an important aspect, yet an unknown subject that still requires investigation from future research. We also suggest future studies aimed at comparing the results between blinded Adjudication and blinded pre-test stage parallel to their non-blinded counterparts, to investigate the effects of minimum human intervention during LLM workflow. Third, this study relies on a commonly available, unspecialized LLM with an unretrievable random seed value and unknown dataset completeness, which hinders exact reproducibility of the generated item under the same prompt. Our study results suggest that proprietary LLMs may provide very powerful computational speed and access, but their limited dataset has been proven to severely limit the response results, no matter how detailed the given prompt is. This demonstrates that a specific LLM with a more transparent dataset and open-sourced model architecture is much more compatible, valuable, and important in the field of medicine.

Availability of data and material

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. The dataset is available only in the Indonesian language. Input log data used and analyzed for current study are available as supplementary material. The data is available as is and only in Indonesian language.

SUPPLEMENTARY MATERIAL

CONCLUSION

This study shows that AI is able to generate an insomnia questionnaire and satisfy content validity requirements. This study also shows the general ability of AI in bypassing language barriers, as well as demonstrating that human intervention is still required to address various specific issues involving topic stability and language diction. The final product of this study is meant to be a set of easy, self-assessment questions to detect insomnia symptoms, and to alert is user when to seek medical help. While some of the starting requirements for a good questionnaire have been evaluated in the study, further studies, such as in the scope of linguistic validity and questionnaire psychometry, are still required to examine whether LLM is ready for truly assisting the medical diagnostic process.

Acknowledgment:

We are very thankful to Zamroni Afif, M.D, Ph.D., from Brawijaya University, East Java, Indonesia, who has provided blinded neurosomnology expertise support during the data acquisition process of this research.

Authors’ contributions:

All authors whose names appear on the submission made substantial contributions to the making of this manuscript. I.I., F., and W.R.I. performed the data acquisition. I.I. performed data analysis, drafted the manuscript, and revised it critically for important intellectual content. W.R.I. provides expert consultation in the respective field. Finally, F. and W.R.I. reviewed and approved the version to be published. All authors agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Ethical approval:

The study was approved by the Ethics Committee of Dr. Soetomo General Academic Hospital, Surabaya, with ethical clearance No. 1088/KEPK/VIII/2024, dated on 22nd August 2024.

Declaration of patient consent:

The authors certify that they have obtained all appropriate patient consent.

Conflicts of interest:

There are no conflicts of interest.

Use of artificial intelligence (AI)-assisted technology for manuscript preparation:

The authors confirm that there was no use of artificial intelligence (AI)-assisted technology for assisting in the writing or editing of the manuscript and no images were manipulated using AI.

Financial support and sponsorship: Nil.

References

  1. , , , , , , et al. International study of the prevalence and factors associated with insomnia in the general population. Sleep Med. 2021;82:186-92.
    [CrossRef] [PubMed] [Google Scholar]
  2. , . Prevalence, social and health correlates of insomnia among persons 15 years and older in Indonesia. Psychol Health Med. 2019;24:757-68.
    [CrossRef] [PubMed] [Google Scholar]
  3. , , , , . Primary care is the frontline for help-seeking insomnia patients. Eur J Gen Pract. 2021;27:286-93.
    [CrossRef] [PubMed] [Google Scholar]
  4. , , , , , , et al. Help-seeking behavior of young and middle-aged Austrians with chronic insomnia: Results from the 2017 national sleep survey. Sleep Epidemiol. 2021;1:100002.
    [CrossRef] [Google Scholar]
  5. , , , , , , et al. European guideline for the diagnosis and treatment of insomnia. J Sleep Res. 2017;26:675-700.
    [CrossRef] [PubMed] [Google Scholar]
  6. . Pedoman praktik klinis neurologi 2023. Perhimpunan Dokter Neurologi Seluruh Indonesia. 2023
    [Google Scholar]
  7. , , , , , , et al. The future landscape of large language models in medicine. Commun Med. 2023;3:141.
    [CrossRef] [PubMed] [Google Scholar]
  8. , , , , , , et al. Artificial intelligence in clinical medicine: Catalyzing a sustainable global healthcare paradigm. Front Artif Intell. 2023;6:1227091.
    [CrossRef] [PubMed] [Google Scholar]
  9. , , , , , , et al. The role of large language models in transforming emergency medicine: Scoping review. JMIR Med Inform. 2024;12:e53787.
    [CrossRef] [PubMed] [Google Scholar]
  10. . Can ChatGPT rescue or assist with language barriers in healthcare communication? Patient Educ Couns. 2023;115:107940.
    [CrossRef] [PubMed] [Google Scholar]
  11. , , . Assessing sleep problems and daytime functioning: A translation, adaption, and validation of the Athens Insomnia Scale for non-clinical application (AIS-NCA) Psychol Health. 2023;38:1006-31.
    [CrossRef] [PubMed] [Google Scholar]
  12. , . The TRAPD approach as a method for questionnaire translation. Front Psychiatry. 2023;14:1199989.
    [CrossRef] [PubMed] [Google Scholar]
  13. , , . Questionnaires that screen for multiple sleep disorders. Sleep Med Rev. 2017;32:37-44.
    [CrossRef] [PubMed] [Google Scholar]
  14. , , . Evaluation of methods used for estimating content validity. Res Social Adm Pharm. 2019;15:214-21.
    [CrossRef] [PubMed] [Google Scholar]
  15. . Three coefficients for analyzing the reliability and validity of ratings. Educ Psychol Meas. 1985;45:131-42.
    [CrossRef] [Google Scholar]
  16. , , . Is the CVI an acceptable indicator of content validity? Appraisal and recommendations. Res Nurs Health. 2007;30:459-67.
    [CrossRef] [PubMed] [Google Scholar]
  17. , , , , . Large language models in medical education: Comparing ChatGPT-to human-generated exam questions. Acad Med. 2024;99:508-12.
    [CrossRef] [PubMed] [Google Scholar]
  18. , , . Analysing the applicability of ChatGPT, bard, and bing to generate reasoning-based multiple-choice questions in medical physiology. Cureus. 2023;15:e40977.
    [CrossRef] [Google Scholar]
  19. , , , , , , et al. Large language models encode clinical knowledge. Nature. 2023;620:172-80.
    [CrossRef] [PubMed] [Google Scholar]
  20. , , , . The insomnia severity index: Psychometric indicators to detect insomnia cases and evaluate treatment response. Sleep. 2011;34:601-8.
    [CrossRef] [PubMed] [Google Scholar]
  21. , , . Athens insomnia scale: Validation of an instrument based on ICD-10 criteria. J Psychosom Res. 2000;48:555-60.
    [CrossRef] [PubMed] [Google Scholar]
  22. , , , , , , et al. Validation of a french version of the sleep condition indicator: A clinical screening tool for insomnia disorder according to DSM-5 criteria. J Sleep Res. 2017;26:702-8.
    [CrossRef] [PubMed] [Google Scholar]
  23. , , , . Development of the insomnia screening scale based on ICSD-II. Int J Psychiatry Clin Pract. 2012;16:259-67.
    [CrossRef] [PubMed] [Google Scholar]
  24. . International classification of sleep disorders (3rd ed). United States: American Academy of Sleep Medicine; .
    [Google Scholar]
  25. . International classification of sleep disorders In: Text revision (ICSD-3-TR) (3rd ed). United States: American Academy of Sleep Medicine; .
    [Google Scholar]
  26. , , , , , , et al. The importance of content and face validity in instrument development: Lessons learnt from service users when developing the Recovering Quality of Life measure (ReQoL) Qual Life Res. 2018;27:1893-902.
    [CrossRef] [PubMed] [Google Scholar]
  27. , , , , . Development and validation of the adolescent insomnia questionnaire. J Pediatr Psychol. 2020;45:61-71.
    [CrossRef] [PubMed] [Google Scholar]
  28. , , , , , , et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med. 2024;7:41.
    [CrossRef] [PubMed] [Google Scholar]
  29. , , , , , , et al. Performance of two large language models for data extraction in evidence synthesis. Res Synth Methods. 2024;15:818-24.
    [CrossRef] [PubMed] [Google Scholar]
  30. , , , , . Evaluation of large language model performance and reliability for citations and references in scholarly writing: Cross-disciplinary study. J Med Internet Res. 2024;26:e52935.
    [CrossRef] [PubMed] [Google Scholar]
  31. , , . Language proficiency among respondents: Implications for data quality in a longitudinal face-to-face survey. J Surv Stat Methodol. 2021;9:73-93.
    [CrossRef] [Google Scholar]
  32. , , . Developing and validating a questionnaire on young learners' multilingualism and multilingual identity. Lang Learn J. 2021;49:404-19.
    [CrossRef] [Google Scholar]
  33. . Validation of the Korean version of the mini-sleep questionnaire-insomnia in Korean college students. Asian Nurs Res (Korean Soc Nurs Sci). 2017;11:1-5.
    [CrossRef] [PubMed] [Google Scholar]
  34. , , , , . Linguistic validation of a questionnaire for the screening of OSAS in a paediatric population with Down Syndrome. Eur J Paediatr Dent. 2023;23:128-30.
    [Google Scholar]
  35. , , , , , , et al. Reliability and validity of the Pittsburgh Sleep Quality Index among frontline COVID-19 health care workers using classical test theory and item response theory. J Clin Sleep Med. 2022;18:541-51.
    [CrossRef] [PubMed] [Google Scholar]
  36. , , , , . Assessment of the psychometric properties of the holland sleep disorders questionnaire in the Iranian population. Sleep Disord. 2022;2022:1367067.
    [CrossRef] [PubMed] [Google Scholar]
  37. , , . Structural validity of the Insomnia severity index: A systematic review and meta-analysis. Sleep Med Rev. 2021;60:101531.
    [CrossRef] [PubMed] [Google Scholar]
  38. , , , . Validation of the Insomnia Severity Index (ISI) for identifying insomnia in young adult cancer survivors: Comparison with a structured clinical diagnostic interview of the DSM-5 (SCID-5) Sleep Med. 2021;81:80-5.
    [CrossRef] [PubMed] [Google Scholar]
  39. , , . Pseudo-random number generator influences on average treatment effect estimates obtained with machine learning. Epidemiology. 2024;35:779-86.
    [CrossRef] [PubMed] [Google Scholar]
  40. , , . Challenges to the reproducibility of machine learning models in health care. JAMA. 2020;323:305-6.
    [CrossRef] [PubMed] [Google Scholar]
  41. , . Big Data and machine learning in health care. JAMA. 2018;319:1317-8.
    [CrossRef] [PubMed] [Google Scholar]
  42. , . Neural networks and deep learning: A brief introduction. Intensive Care Med. 2019;45:712-4.
    [CrossRef] [PubMed] [Google Scholar]
  43. , , . Quality of random number generators significantly affects results of Monte Carlo simulations for organic and biological systems. J Comput Chem. 2011;32:513-24.
    [CrossRef] [PubMed] [Google Scholar]
  44. , , . Reproducible machine learning research in mental workload classification using EEG. Front Neuroergon. 2024;5:1346794.
    [CrossRef] [PubMed] [Google Scholar]
  45. , , , , . Introduction to machine learning, neural networks, and deep learning. Transl Vis Sci Technol. 2020;9:14.
    [Google Scholar]
  46. , , , , . The long but necessary road to responsible use of large language models in healthcare research. NPJ Digit Med. 2024;7:177.
    [CrossRef] [PubMed] [Google Scholar]
  47. , , , , , , et al. Ethical considerations in the use of artificial intelligence and machine learning in health care: A comprehensive review. Cureus. 2024;16:e62443.
    [CrossRef] [PubMed] [Google Scholar]
  48. , , , , , , et al. Ethical considerations for artificial intelligence in medical imaging: Data collection, development, and evaluation. J Nucl Med. 2023;64:1848-54.
    [CrossRef] [Google Scholar]
  49. , , . The ethics of disclosing the use of artificial intelligence tools in writing scholarly manuscripts. Res Ethics. 2023;19:449-65.
    [CrossRef] [PubMed] [Google Scholar]
  50. , , , , , , et al. Medical large language models are vulnerable to data-poisoning attacks. Nat Med. 2025;31:618-26.
    [CrossRef] [PubMed] [Google Scholar]
  51. , , , . Ethical data acquisition for LLMs and AI algorithms in healthcare. NPJ Digit Med. 2024;7:377.
    [CrossRef] [PubMed] [Google Scholar]
  52. . Attention is not all you need: The complicated case of ethically using large language models in healthcare and medicine. EBioMedicine. 2023;90:104512.
    [CrossRef] [PubMed] [Google Scholar]
  53. , , . Bias in medical AI: Implications for clinical decision-making. PLOS Digit Health. 2024;3:e0000651.
    [CrossRef] [PubMed] [Google Scholar]
Show Sections