Notes from an AI-Led Research Project on Pharmacovigilance

"Reviewer #2" — the picky, ever unreasonable reviewer — is a famous inside joke in academic publishing. But lately, the review process for every machine learning conference (e.g. ICML, NeurIPS) arrives with a new target of grumbling: the dreaded review generated by a large language model (LLM).

ICML program chairs speculating about LLM-generated reviews on LinkedIn — Program chairs of this year's ICML conference speculate whether an author has been generating reviews with LLMs. Source: LinkedIn

Despite the outcry every single review cycle, authors are also increasingly relying on AI tools themselves for their research. My PhD peers use Cursor to write entire code bases, Overleaf provides a wide set of re-write and summarization tools, and in one Slack channel I'm part of, a student even suggested starting a project by first asking ChatGPT: "Is this idea novel?".

Overleaf's AI writing suggestions feature — Overleaf's feature for academic writing suggestions. Source: Writefull.

While the research community has largely focused on how to regulate the use of AI tools, the recent Agents4Science conference took a completely different approach and asked: what happens in the extreme? If AI systems both lead and review research papers, can interesting research results emerge? To answer this, the conference required authors to use AI throughout their research project. Each paper was then reviewed by a set of AI reviewers.

Visualization of all Agents4Science submissions — A visualization tool for all submissions to Agents4Science, including Echo.

My submission, Echo: A multi-agent AI system for patient-centered pharmacovigilance, was selected as a spotlight. The overall acceptance rate was 18.9%, which is quite competitive.¹ And despite not feeling too attached to the project itself (I only set aside 6 hours to work on it, and heavily relied on Claude Sonnet 3.5), I still felt excitement upon receiving the acceptance notification. Why did I feel ownership over something that an AI system largely created?

The answer, I think, is that I had still invested energy into creative work — not writing or coding, but in steering. Throughout the project, I found myself playing a strange kind of role: part PI, part prompt engineer. I had to revise, re-try, and re-write entire paragraphs, experiencing the same waves of satisfaction and frustration that accompany "normal" research projects.

In this post, I share what it actually felt like to have an AI lead each step of a research project, including:

The Process — When I chose to "give Claude the wheel", when I had to take it back, and the eventual division-of-labor that emerged.
The Submission — The eventual result and feedback, and why the conference's "AI involvement checklist" is limited.
The Takeaways — How I now wish to use AI in my research projects in the future, and why I believe that research will remain a very human endeavor, even if our tools keep improving.

The Process

My project for Agents4Science was Echo, a multi-agent tool for identifying novel adverse drug reactions (ADRs) from social media forums. This can be useful for pharmacovigilance, as drug safety warnings typically rely on clinical trials or electronic health records, which may not be timely or robust.

I created a time-box of 6 hours for this project. I also didn't spend much time exploring different language models. For no real reason,² I typically use ChatGPT for quick questions and Claude Sonnet 3.5 for more complex writing and coding tasks, and stuck with the latter here.

Overview diagram of Echo's multi-agent system — Overview of Echo, a multi-agent AI system for patient-centered pharmacovigilance I built in collaboration with Claude Sonnet 3.5.

Step 1: Brainstorming the Research Idea

The idea behind Echo was 100% human, driven from my personal experiences. Oncology patients often have to take many drugs concurrently, and often turn to online support groups for answers to every-day questions about side effects (how long after a chemo infusion does the metallic taste last? did you ever have arms tingling after radiation? why is my hair now curly?). These support groups are wonderful spaces, and contain rich information I believe go beyond what doctors might be able to keep track of. One public example is the r/cancer subreddit.

Therefore, I was personally passionate about my original idea: can we identify drug safety signals from social media before official reporting? While I didn't love the idea of using Reddit data³, that seemed easiest to start with as a proof-of-concept. At this stage, I simply prompted Claude just for some validation and "spark":

Initial brainstorming conversation with Claude about Echo — Initial brainstorming with Claude Sonnet 3.5 on the idea behind Echo.

AI Helpfulness

Claude's response motivated me to continue. The proposed building blocks seemed reasonable, and after back-and-forth I decided on the 4 agents: Explorer, Analyzer, Verifier, and Proposer.

Step 2: Building the Explorer Agent

The purpose of the Explorer agent is to identify (drug, side-effect) pairs across Reddit posts, while handling patient's informal language. Simply asking Claude Sonnet 3.5 to generate code for this agent resulted in a pretty good baseline:

Initial code for the Explorer agent to crawl Reddit posts in order to identify novel adverse drug reactions (left) using a language model-based extraction method (right).

Given my 6-hour limit, I appreciated not having to put in cognitive effort to think about the code's skeleton. However, after going through the effort of setting up Reddit credentials, I realized a big flaw in Claude's choice to use PRAW: it can only access the most recent 1000 posts on a subreddit! Part of my goal was to show that Echo can discover ADRs not reported in the literature, and I knew I'd need to prove this using older posts (e.g. from 2017). Sadly, Claude couldn't find a solution to this.

AI Helpfulness

Frustrated by a dead-end from Claude's suggestion, I completed the Explorer agent myself — finding a hacky SerpAPI workaround and editing the extraction prompt to include temporal proximity and community support signals.

Example of Explorer agent extracting a patient's dental sensitivity — An example of Echo's Explorer agent extracting a patient's dental sensitivity while taking the chemotherapy drug oxaliplatin.

Step 3: Finding Confounding Variables with the Analyzer Agent

The next step was to build the Analyzer agent, which combines all (drug, side effect) pairs discovered by the Explorer agent, and creates summary statistics of 3 metrics: temporal proximity, patient confidence, and community engagement. These metrics were output from the Explorer agent, but if I had more time I'd have liked to do something more rigorous (e.g. use Reddit post upvotes for engagement).

Example confounding variables identified by the Analyzer agent for adverse drug reactions discovered from Reddit data. Source: Paper

AI Helpfulness

Claude easily generated the Analyzer script, and it was easy to add a confounders feature. But I really wish I could have done something more rigorous, such as creating an underlying causal graph from unstructured text.⁵

Step 4: Evaluations with the Verifier Agent

By 3 hours, it was time to work on evaluating Echo. From Claude, I learned about the FDA's Adverse Event Reporting System (FAERS), which seemed like a reasonable ground truth database. I then asked Claude to write a Python function that takes in a (drug, side-effect) pair and year as inputs, and returns the number of corresponding instances up until the target year.

At this point, I realized it was pretty useful that Claude had included a drug_canonical field in the Explorer agent's code all on its own: it's hard to automatically match canonical drug names without leaving it to LLM magic. We were now able to label all (drug, side-effect) pairs with FAERS counts, and then calculate a Verifier score:

score = ln(FAERS_count + 1)

Adverse drug reactions with a score of 0 were considered novel, but the work now fell on me to sanity check if these findings made sense! Some ADRs discovered by Echo with a score of 0, along with patient quotes, were:

therapeutic effect on multiple sclerosis, chemotherapy ("my neurologist just keeps saying, well chemo will treat your MS too")
deep vein thrombosis, abraxane ("she had folfirinox and then abraxane…a week later she had a stroke/DVTs")
hepatotoxicity, ivermectin ("My uncle tried it and badly damaged his liver. It almost killed him.")

Unfortunately, a quick Google search showed that many of the associations given a score of 0 are known by the medical community. Claude was no longer helpful in automating this part of the evaluation. For the paper, I reported 5 ADRs that appeared novel after I spent a lot of time manually digging. With time, I'd have spoken to a researcher in the medical community.

5 potentially novel drug-side effect associations discovered by Echo — 5 potentially novel (drug, side-effect) associations discovered by Echo from Reddit data. Source: Paper

AI Helpfulness

The only useful thing Claude provided was a pointer to the FAERS database. The actual hard research questions — what evaluation makes sense, quantifying statistical signals — required deeper thinking. And I wanted to do that thinking. That's the joy of research!

Step 5: Case Studies and Literature Search

The final set of results I wanted to include were case studies showing that if we restricted Echo to only look at Reddit posts before a certain year (e.g. 2017), it could identify ADRs not reported in the literature. I first used Claude to brainstorm suggestions:

Claude suggesting case studies for ADRs discovered by Echo — Suggested case studies from Claude of ADRs discovered by Echo that might have preceded any FDA warnings.

While the gold-stars in Claude's response created a sense of excitement, I quickly realized many of these were not suitable case studies. For example, physicians were aware of CAR-T CRS since the start of clinical trials in 2017. I also spent a lot of time trying to verify when a side-effect was truly first documented for a drug. I eventually reported the following case studies:

Pneumonitis: (nivolumab, pneumonitis) and (pembrolizumab, pneumonitis)
Neuromuscular complications: (nivolumab, myasthenia gravis-like autoimmune neuropathy)
Hepatotoxicity: (regorafenib, hepatotoxicity)

AI Helpfulness

Claude was helpful for brainstorming but didn't have the long-context sensitivity to accurately identify if an ADR in our data was reported before its appearance in the literature.

Step 6: Designing a User Interface

My original vision was that Echo might eventually be a web interface that lets a drug safety expert quickly discover potential side-effects of a drug on an interactive interface. I asked Claude to create a clean interface that takes the aggregated data from Echo as a JSON file:

The interactive interface created by Claude for Echo — The interface created by Claude, run as a Python script on the command line, for visualizing and interacting with data from Echo.

AI Helpfulness

I was blown away by the interface Claude created. Designing clean interfaces has always been frustrating, but with a bit of "thinking time" Claude produced something really great. My faith in AI as a research companion was somewhat rejuvenated.

The Submission

At the end of the 6 hours, I was pretty exhausted and dissatisfied with the final submission. Prompting Claude repeatedly just isn't fun.

The Writing

I used Claude to write most sections of the paper given our results. During this last stage, I discovered a highly-relevant paper that Claude never suggested even when prompted for related works! Ransohoff et al. had shown in 2018 that a deep-learning system could identify skin adverse reactions from chemotherapy drugs before reported in the literature. The primary difference was our ability to consider a larger scope and diverse kinds of patient language. If I had done my due diligence, this paper surely would have come up: it is published, well-cited, extremely relevant, and even by Stanford-affiliates I know. How did Claude miss this? Was it my prompt?

The Checklist

Agents4Science asked all authors to report how they used AI in a checklist at the end of the paper, which they then used to assign "autonomy scores" to each submission (mine was 7, two outstanding papers were 9!).

But I don't think the checklist captures the full extent of how I used AI. For example, I often brainstormed by pressing the "retry" button on the Claude UI several times to explore different answers to the same question. For writing, I didn't just ask Claude to write the full paper, but provided guidance (e.g. "make this sound more academic") via back-and-forth dialogue.

The AI Reviews

Surprisingly, I found the AI-generated reviews helpful and encouraging. Negative feedback included:

Subjective, unspecified scoring for temporal, confidence, and community metrics — I agree with this feedback, and I think it is an interesting trade-off between using an LLM vs. a potentially underspecified formula.
The possibility that some "novel" associations are known but not well-documented — I agree with this feedback, and really want to now reach out to medical experts.
The dataset is extremely small (187 posts) — While our eventual Reddit dataset was small due to budget constraints with SerpAPI, I realized I viewed this positively: from only 187 posts, Echo identified 640 ADRs! This feedback pushed me to clarify the framing of our initial dataset.

There was one AI reviewer (Gemini?) that gave strong scores to every submission, but even knowing this interestingly didn't prevent me from feeling a bit of an ego-boost upon reading "The paper is of exceptionally high quality." :)

The Takeaways

Despite AI handling much of my submission, the most valuable parts of research for me (creative direction, evaluation, and thinking) remained fundamentally human tasks. I realized:

AI is best for acceleration, not direction — Claude excelled at generating code skeletons, creating interfaces, and drafting text. It struggled with decisions like designing appropriate evaluation metrics or identifying truly novel findings, which all take a significant amount of time.
AI tools miss critical context — I'm still annoyed that Claude didn't suggest an important related work (Ransohoff et al., 2018), despite creating a whole literature review section. I now value doing a proper literature search myself, using AI to only partially help with discovery.
Effective human-AI collaboration requires new skills — I ended up having to learn when to "give Claude the wheel" versus taking back control. Learning when to trust AI tools is starting to feel like its own skill, but it's hard to think about how to teach this when the models keep changing.
The joy of research is in the thinking — Even with AI assistance, I realized how much I wanted to engage more deeply with the work. I wish I had given myself more time to read papers, analyze data, and consult experts.
AI-generated reviews are helpful — We complain a lot about LLM-generated reviews, but I actually do think they serve as a strong first-pass feedback with actionable suggestions to improve the paper.

Conclusion

One aspect that the "AI replacing researchers" hype misses: research is satisfying because we're curious and imagine questions we want to answer, not just because we want papers accepted. Collaborating with Claude felt like mentoring (via text) a very fast and smart student, but who lacked their own intrinsic motivation. As I start to think about whether or not I would enjoy staying in academia, I wonder if such situations would feel meaningful?

Comic about AI replacing humans — Source: Instagram

For reference, the acceptance rate for NeurIPS 2025, the biggest conference in the machine learning and AI community, was ~25%. ↩
This is what many people in Silicon Valley refers to as "vibes". ↩
Reddit has pretty stringent policies in place to protect user's privacy, and I don't believe Echo could be deployed using Reddit data without some kind of permissions. ↩
https://www.seangoedecke.com/ai-sycophancy/ ↩
https://arxiv.org/abs/2407.07018 ↩