Paper review
Review is due at 11:59pm on Monday, May 12, 2025. Please submit your review as one single PDF document on GradeScope.
We ask you to review two (2) papers of your choice from the following options:
- (Reasoning) Training Large Language Models to Reason in a Continuous Latent Space
- (Evaluation) EvalAgent: Discovering Implicit Evaluation Criteria from the Web
- (Agent) Gorilla: Large Language Model Connected with Massive APIs
- (Tool use) xLAM: A Family of Large Action Models to Empower AI Agent Systems
- (Deep research) DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
You should pretend that these papers have not been peer-reviewed, and you are tasked to be a reviewer and share your honest, critical feedback. Also please be aware of the publication date and not contextualize their novelty / performance in concurrent or more recent papers.
To see some more samples for paper reviews you can check the open review website https://openreview.net. For example, these are sample reviews from the most recent NeurIPS conference: https://openreview.net/group?id=NeurIPS.cc/2024/Conference
Generative AI policy. You may not use generative AI tools in any way to write your paper reflections.
Review format
When reviewing each paper, please answer the following questions (adapted from the NeurIPS 2024 Reviewer Guidelines):
-
Summary: Briefly summarize the paper and its contributions. What is the problem or question this paper addresses? What is the solution or approach? This is not the place to critique the paper; the authors should generally agree with a well-written summary.
-
Strengths and Weaknesses: Please provide a thorough assessment of the strengths and weaknesses of the paper, touching on each of the following dimensions:
a. Originality: Are the tasks or methods new? Is the work a novel combination of well-known techniques? (This can be valuable!) Is it clear how this work differs from previous contributions? Is related work adequately cited?
b. Quality: Is the submission technically sound? Are claims well supported (e.g., by theoretical analysis or experimental results)? Are the methods used appropriate? Is this a complete piece of work or work in progress? Are the authors careful and honest about evaluating both the strengths and weaknesses of their work?
c. Clarity: Is the submission clearly written? Is it well organized? (If not, please make constructive suggestions for improving its clarity.) Does it adequately inform the reader? (Note that a superbly written paper provides enough information for an expert reader to reproduce its results.)
d. Significance: Are the results important? Are others (researchers or practitioners) likely to use the ideas or build on them? Does the submission address a difficult task in a better way than previous work? Does it advance the state of the art in a demonstrable way? Does it provide unique data, unique conclusions about existing data, or a unique theoretical or experimental approach?
-
Questions: Please list up and carefully describe any questions and suggestions for the authors. Think of the things where a response from the author can change your opinion, clarify a confusion or address a limitation. This can be very important for a productive rebuttal and discussion phase with the authors.
-
Limitations: Have the authors adequately addressed the limitations and potential negative societal impact of their work? If not, please include constructive suggestions for improvement. In general, authors should be rewarded rather than punished for being up front about the limitations of their work and any potential negative societal impact. You are encouraged to think through whether any critical points are missing and provide these as feedback for the authors.
-
Ethical concerns: If there are ethical issues with this paper, please flag them.
- Soundness: Please assign the paper a numerical rating on the following scale to indicate the soundness of the technical claims, experimental and research methodology and on whether the central claims of the paper are adequately supported with evidence.
- 4 excellent
- 3 good
- 2 fair
- 1 poor
- Presentation: Please assign the paper a numerical rating on the following scale to indicate the quality of the presentation. This should take into account the writing style and clarity, as well as contextualization relative to prior work.
- 4 excellent
- 3 good
- 2 fair
- 1 poor
- Contribution: Please assign the paper a numerical rating on the following scale to indicate the quality of the overall contribution this paper makes to the research area being studied. Are the questions being asked important? Does the paper bring a significant originality of ideas and/or execution? Are the results valuable to share with the broader AI community.
- 4 excellent
- 3 good
- 2 fair
- 1 poor
- Overall: Please provide an “overall score” for this submission. Choices:
- 10: Award quality: Technically flawless paper with groundbreaking impact on one or more areas of AI, with exceptionally strong evaluation, reproducibility, and resources, and no unaddressed ethical considerations.
- 9: Very Strong Accept: Technically flawless paper with groundbreaking impact on at least one area of AI and excellent impact on multiple areas of AI, with flawless evaluation, resources, and reproducibility, and no unaddressed ethical considerations.
- 8: Strong Accept: Technically strong paper with, with novel ideas, excellent impact on at least one area of AI or high-to-excellent impact on multiple areas of AI, with excellent evaluation, resources, and reproducibility, and no unaddressed ethical considerations.
- 7: Accept: Technically solid paper, with high impact on at least one sub-area of AI or moderate-to-high impact on more than one area of AI, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations.
- 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations.
- 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly.
- 4: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly.
- 3: Reject: For instance, a paper with technical flaws, weak evaluation, inadequate reproducibility and incompletely addressed ethical considerations.
- 2: Strong Reject: For instance, a paper with major technical flaws, and/or poor evaluation, limited impact, poor reproducibility and mostly unaddressed ethical considerations.
- 1: Very Strong Reject: For instance, a paper with trivial results or unaddressed ethical considerations
- Confidence: Please provide a “confidence score” for your assessment of this submission to indicate how confident you are in your evaluation. Choices
- 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.
- 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
- 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
- 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
- 1: Your assessment is an educated guess. The submission is not in your area or the submission was difficult to understand. Math/other details were not carefully checked.