Active and Passive Crowdsourcing in Medical Research

With the rise of crowdsourcing and the increasing popularity of platforms that allow individuals to collaborate and share information (knowingly or unknowingly), it's important to consider how these tools can be leveraged to advance health and medicine. 

Crowdsourcing is used in medical research for clinical data collection, data analysis and clinical trials, as well as to crowdfund biomedical research.  The ‘crowd’, or large numbers of people, can be used to contribute knowledge inputs (and expert inputs in particular) to solve complex problems. Extreme citizen science and quantified-self movement would provide valuable data and insights. However, since active participation is limited, passive methods are more commonly used. Passive crowdsourcing refers to the process of collecting data or information from individuals without their explicit knowledge or consent, typically through the use of tracking technologies such as cookies or by analyzing data from social media or search engines. This type of crowdsourcing is becoming increasingly prevalent as more and more of our daily activities are conducted online, and companies and organizations use this data to gain insights and make decisions. Rigorous ethical and regulatory controls are needed to ensure data are collected and analyzed appropriately. Ironically, existing processes mostly slow down beneficial research, while are not preventing "dark crowdsourcing," "shadow crowdsourcing" and "covert crowdsourcing".  Examples are CAPTCHAs, making you work for others without realizing it; internet footprints used by companies to learn more about users' behavior in order to make them addicted to their products, and mobile data from cell phones to analyze traffic and street topographies.

The newly launched MedPaLM, a large language model from Google/Mind aligned to the medical domain, is an example of how knowledge harvested from the crowds can be utilized in health research. It utilizes frequently searched medical inquiries (googles own HealthSearchQA dataset) and the vast amount of data available on the internet: MedQA, collected from the professional medical board exams; MedMCQA, containing more than 194k high-quality AIIMS (All India Institute Of Medical Science) & NEET PG entrance exam questions with multiple correct/incorrect answers covering 2.4k healthcare topics and 21 medical subjects, PubMedQA collected from PubMed abstracts, LiveQA with question-answer pairs from the US National Library of Medicine; MedicationQA, with question-answer pairs from trusted sources such as DailyMed, MedlinePlus, cdc.gov, mayoclinic.org, health.harvard.edu, PubMed, and MMLU clinical topics.

Crowdsourcing could result in high-quality outcomes, broad community engagement, and more open science. But the models and platforms currently in use are far from perfect, and there is still much work to be done to fully harness the power of the collective. 


REFERENCES

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019).

Tucker JD, Day S, Tang W, Bayus B. Crowdsourcing in medical research: concepts and applications. PeerJ. 2019 Apr 12;7:e6762.

Abacha, A. B., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. in TREC (2017), 1–12.

Abacha, A. B., Mrabet, Y., Sharp, M., Goodwin, T. R., Shooshan, S. E. & Demner-Fushman, D. Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers. in MedInfo (2019), 25–29.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).

Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. & Szolovits, P. What disease does this patient have? a large-scale
open domain question answering dataset from medical exams. Applied Sciences 11, 6421 (2021).

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: A dataset for biomedical research question answering.
arXiv preprint arXiv:1909.06146 (2019)

Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical
domain Question Answering in Conference on Health, Inference, and Learning (2022), 248–260.

Wazny K. “Crowdsourcing” ten years in: A review. Journal of global health. 2017 Dec;7(2).

Tucker JD, Day S, Tang W, Bayus B. Crowdsourcing in medical research: concepts and applications. PeerJ. 2019 Apr 12;7:e6762.

Wang C, Han L, Stein G, Day S, Bien-Gund C, Mathews A, Ong JJ, Zhao PZ, Wei SF, Walker J, Chou R. Crowdsourcing in health and medical research: a systematic review. Infectious diseases of poverty. 2020 Dec;9(1):1-9.

Tan RK, Wu D, Day S, Zhao Y, Larson HJ, Sylvia S, Tang W, Tucker JD. Digital approaches to enhancing community engagement in clinical trials. NPJ digital medicine. 2022 Mar 25;5(1):1-8.

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:2212.13138. 2022 Dec 26.

Comments