Copilot Researcher - A better research assistant

Author

Raghavendra

Published

April 20, 2025

Introduction

In the age of information overload, researchers and knowledge workers are constantly battling the ever-growing ocean of academic papers, preprints, and literature. What if you could summon a digital assistant that fetches the most relevant open-access papers, extracts insights from them, helps you brainstorm using cutting-edge large language models, and even generates ready-to-use citations?

Say hello to ARA: Agentic Research Assistant, a smart, modular assistant built with LangChain, FAISS, and Google AI Studio. This assistant takes a simple query and scours the web for relevant papers, summarizes and indexes them into a searchable vector store, and uses Retrieval-Augmented Generation (RAG) to answer complex research questions with clarity and context.

Whether you’re exploring a new field or compiling a related work section, this assistant is designed to accelerate and enrich your workflow.

In this blog post, we’ll walk through how the assistant works, the tools it uses under the hood, and how you can easily adapt it to your own research needs.

Features of the Research Assistant Agent

At the heart of this project is a smart and modular agent that leverages a suite of custom tools — each designed to enhance the research workflow from search to citation. Here’s a breakdown of the key capabilities:

1. Neural Search of Open-Access Papers (Powered by Exa AI)

The agent can perform intelligent paper retrieval by querying Exa AI, a search engine that returns high-quality academic content via neural search. Given a topic, phrase, or even an initial PDF, the assistant uses this capability to:

Discover open-access research papers that are contextually relevant.
Return URLs, titles, and abstracts for each result.
Bootstrap the literature discovery process using state-of-the-art semantic retrieval.

This means you no longer have to sift through pages of generic results — your assistant surfaces the most meaningful papers, fast.

2. Chunking and Storing Papers in a Vector Store

Once the relevant papers are retrieved, the agent processes them into manageable chunks using LangChain’s document loaders and text splitters. These chunks are:

Embedded using Google’s Gemini Embeddings.
Stored efficiently in a FAISS vector store along with metadata (title, link, summary, etc.).
Organized so that future queries can retrieve the most relevant snippets instantly.

This setup enables lightning-fast retrieval and lays the foundation for contextual Q&A using Retrieval-Augmented Generation (RAG).

3. Retrieval from the Vector Store

When you ask the agent a question, it doesn’t just guess — it retrieves the most relevant pieces of text from your custom literature database. Here’s how:

The question is embedded using the same embedding model.
FAISS performs a similarity search across all stored document chunks.
The top-matching documents are returned as context to the LLM (Google AI Studio) for a grounded and accurate response.

This RAG pipeline ensures that answers are always backed by real literature — reducing hallucination and increasing trust.

4. BibTeX Citation Generation

The final step is turning your discoveries into citable references. The agent includes a tool to:

Extract the arXiv IDs from stored documents.
Query services like arxiv2bibtex.org to fetch accurate BibTeX or BibLaTeX entries.
Automatically generate a .bib file that can be downloaded or plugged into your LaTeX document.

No more manually copying citations — your agent builds the bibliography for you.

Demo Use Case: Exploring AI for Scientific Discovery

To demonstrate the capabilities of our research assistant, let’s walk through a real-world use case — exploring the role of AI in accelerating scientific discovery.

Step 1: Query the Assistant

We begin with a simple prompt:

“Recent advances in using AI for accelerating scientific research and discovery.”

The assistant sends this query to the Exa AI neural search engine, which returns a curated list of open-access papers from sources like arXiv, Semantic Scholar, and institutional repositories. Each paper includes metadata like title, link, and abstract.

Step 2: Store the Papers as a Vector Database

The retrieved documents are chunked into context-aware sections using LangChain’s RecursiveCharacterTextSplitter, and then embedded using Google Gemini models. These vectors are stored in a FAISS vector index, complete with rich metadata like:

Title and source link
Summary or abstract
Full body text (for RAG)

This makes the papers instantly searchable and ready for downstream question answering.

Step 3: Ask a Research Question

We now ask the assistant:

“What are the most promising approaches for AI-assisted hypothesis generation?”

The assistant performs a vector search over the stored documents and retrieves relevant chunks. These are passed as context to Google AI Studio, which generates a grounded, coherent response based on actual content from the papers — not just its own pretraining.

Step 4: Generate BibTeX Citations

Finally, we generate citations for all the referenced documents:

The assistant extracts arXiv IDs and uses arxiv2bibtex.org to fetch BibTeX entries.
It compiles them into a downloadable .bib file — ready to plug into Overleaf or your LaTeX thesis.

Example output:

@article{deepmindAlphaFold,
  title={Highly accurate protein structure prediction with AlphaFold},
  author={Jumper, John and others},
  year={2021},
  journal={Nature},
  url={https://arxiv.org/abs/XXXX.XXXX}
}

Conclusion and Future Work

Conclusion

As demonstrated through our end-to-end walkthrough, the agent effectively leverages tool-based reasoning and retrieval-augmented generation to streamline the research process. The execution traces clearly show how the agent calls each tool with purpose, from searching papers to generating citations, thus providing intelligent assistance at every step.

As a researcher, this assistant has proven to be incredibly valuable. Keeping track of relevant literature, notes, and citations can often feel overwhelming. With the sheer volume of scientific content published daily, it’s easy to lose sight of the main objective. This assistant helps realign focus by surfacing high-quality content and enabling rapid synthesis, whether it’s for exploring new topics, cross-referencing facts, or drafting literature reviews.

The ability to interact with papers through natural language, receive contextual responses, and auto-generate BibTeX citations greatly accelerates the traditionally slow and manual aspects of academic writing. What would normally take hours of searching and organizing can now be achieved in minutes, a true leap forward for individual productivity and collaborative research.

Future Work

While the current system is already powerful, several exciting directions remain for future development:

Deep Document Understanding: Enhance the assistant’s capabilities to not just retrieve and summarize but also interpret mathematical formulations, diagrams, and results using multimodal models.
Zotero Integration: Building a bridge to tools like Zotero would enable seamless citation management, note-taking, and syncing across platforms — turning this assistant into a true extension of the researcher’s workflow.
Standalone Application: Packaging this notebook into a web-based application with persistent storage, user profiles, and plug-in support could bring this assistant to a broader audience. Researchers could have an always-on, personalized research collaborator.
Collaboration and Sharing: Enable users to share their research sessions, notes, and citations with peers — encouraging knowledge exchange and reproducibility in research.

By continuing to build on this foundation, we move closer to a world where every researcher has an intelligent partner which is accelerating discovery and reducing the friction of scholarly work.

Code

The code and instructions for the project can be found in the following Kaggle Notebook: Notebook link