Methodology
How MindDance retrieves, ranks, tiers, and writes about AIDD papers
MindDance is a daily research brief for people working in AI-driven drug discovery. It is not built as a broad paper dump. The goal is to retrieve a larger pool of relevant AIDD papers, keep the selection logic visible, and publish concise, traceable commentary on the strongest items.
Positioning
Content is prioritized roughly as Drug > Chem ≈ Bio > Med. If a paper is genuinely useful for drug discovery, it can enter the pool whether it is framed as chemistry, biology, structure, methods, or translational work. Papers that are still pure AI, pure biology, pure chemistry, or pure physics are meant to be removed downstream.
The site borrows the transparency principle used by general AI briefing products, but adapts it to a much harder domain boundary: AIDD needs stronger filtering on relevance, not just community buzz.
Daily cadence
The pipeline is designed to run at 08:00 Beijing time. The publish date is the run date, and the paper-date semantics follow a T+1 logic: primarily yesterday in Beijing time, plus whatever relevant papers are already discoverable from upstream sources by the time the run happens. In practice this depends on how quickly each source indexes new material.
Where papers come from
The current primary sources are arXiv, bioRxiv, and PubMed. These sources should be treated as a union of candidate inputs rather than an intersection.
- arXiv: captures q-bio core categories plus broader AI and physics-adjacent categories where AIDD methods often appear.
- bioRxiv: adds preprints from protein design, computational biology, biophysics, and pharmacology-oriented work.
- PubMed: is currently the main path for journal-style AIDD retrieval, especially medicinal chemistry, computational chemistry, structural biology, and computational biology journals.
- Auxiliary signals: citation, repository, and community indicators are used mainly for enrichment and ranking, not as the primary retrieval channel.
Recall broadly, then filter in layers
Layer 1: rule-based screening
The first layer requires both an AI-method signal and an AIDD-domain signal. This layer is not supposed to decide the final editorial set by itself. Its job is to remove obvious noise while keeping the pool large enough for downstream scoring and LLM review.
The domain keywords are organized around the real AIDD workflow: target discovery, pockets and binding, docking and virtual screening, molecule generation and optimization, protein and antibody design, ADMET, synthesis, biomarkers, multi-omics, and translational relevance.
Layer 2: multi-signal scoring
After rule-based inclusion, each paper is scored using signals that better match practitioner value than raw popularity. The most important signals currently include:
- Publication form and venue: journals often outrank preprints, and top journals or top conferences receive stronger weight.
- Institutional signal: leading academic labs, pharma AI teams, and recognized AIDD companies carry more weight.
- Code and reproducibility: publicly available code and repository evidence improve rank.
- Domain strength: whether the paper sits on the drug discovery path instead of merely touching biology or AI language.
- Community and citation signals: used as supplementary evidence, not the sole criterion.
Layer 3: visible tiers instead of binary keep/drop
The site keeps three explicit tiers:
This matters because hiding the candidate tier makes the product look far stricter than it really is, and removes the reader's ability to audit the daily pool.
Layer 4: LLM judge as semantic cleanup
The LLM judge is a second-pass reviewer. It rechecks featured and notable, and can also inspect high-scoring candidates. If a paper slipped through because of keyword overlap but is not genuinely AIDD, it should be pushed back down. If a semantically strong AIDD paper looked weaker in the heuristic stage, it can be promoted.
How the site presents this
Current AIDD topic map
Based on recent AIDD reviews and research patterns, the site is easier to understand through these workflow-oriented groups:
Known limitations
- Source breadth is still limited: retrieval is stronger than before, but still concentrated in arXiv, bioRxiv, and PubMed.
- Date semantics depend on upstream indexing: different APIs expose new papers at different speeds.
- Scoring and topic taxonomy are still evolving: AIDD boundaries are harder to formalize than generic AI news.
- Write-ups are abstract-driven: useful for fast review, but not a substitute for full-paper reading.
FAQ
- How is MindDance different from a generic paper index?
- Generic indexes answer "how do I find papers?" MindDance answers "which AIDD papers matter today, and why?" It is intentionally selective and optimized for practitioner-facing review rather than exhaustive coverage.
- Why expose the candidate tier on sources pages?
- Because transparency is part of the product. Showing candidates lets readers inspect whether the daily pool is too small, too broad, or mis-ranked instead of seeing only the final editorial output.
- What does the LLM judge actually do?
- It is a second semantic filter rather than a writing engine. Its job is to reject papers that still look like pure AI, pure biology, pure chemistry, or pure physics instead of true AIDD content.
- Why avoid first-person commentary in the write-ups?
- Because the site is structured as a research brief, not a personal essay column. The current tone is neutral and compact, centered on the problem, method, validation level, and practical significance.