Yoonjeon Kim

Selected Research

Under Review 2026

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

A diffusion-based RL framework for interleaved visual-textual reasoning. Localized visual editing reduces GRPO rollout computation by 26.9%; factorized reward assignment resolves cross-modal credit assignment, yielding 38% gains over SFT.

Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

paper

Unified multimodal foundation models have enabled both visual understanding and generation within a single framework. Building on this foundation, supervised fine-tuning and reinforcement learning have been employed to facilitate interleaved visual and textual thinking, tightly coupling image generation with textual reasoning. Existing works predominantly rely on autoregressive unified models as the backbone, which incurs substantial computational overhead due to full regeneration of image token sequences at every rollout step. In this work, we instead leverage multimodal discrete diffusion models to develop a reinforcement learning framework for interleaved reasoning. By exploiting bidirectional context modeling, our approach enables localized visual editing, allowing targeted modifications and reducing rollout computation during GRPO by 26.9% compared to full-image editing baselines, with only a minimal performance drop. However, bidirectional multimodal decoding in discrete diffusion models introduces a non-trivial challenge: rewards become implicitly coupled across interleaved image and text tokens, leading to spurious cross-modal credit assignment. To address this, we propose factorized reward assignment across text and vision streams, assigning rewards to their corresponding token segments for stable credit propagation. This yields 11.2% gains over joint reward assignment baselines and 38.04% improvements over the supervised fine-tuned model.

ICML 2026

Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

Meta-awareness objectives let reasoning models self-predict rollout statistics — length, pass-rate, and concepts used — enabling 83% accuracy gains on AIME25 and a 1.28× GRPO training speedup.

Yoonjeon Kim*, Doohyuk Jang*, Eunho Yang · *Equal Contribution

paper code

Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. MAPR utilizes a self-generated task of predicting rollout statistics — specifically length, pass-rate, and concepts used — allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by (i) filtering out trivial or unsolvable prompts, (ii) reducing lengthy generations that tend to be incorrect, and (iii) generating hints relevant to the problem. MAPR yields significant improvements in both accuracy and training efficiency on various reasoning benchmarks: a 1.28× GRPO training speedup, 83.18% gain in accuracy on AIME25, and 13.04% average gain over six mathematics benchmarks.

BibTeX

@inproceedings{
anonymous2026verifying,
title={Verifying Meta-Awareness via Predictive Rewards in Reasoning Models},
author={Yoonjeon Kim, Doohyuk Jang, Eunho Yang},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=Vl3tXPbjSH}
}

CVPR 2025

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

AugCLIP derives an ideal-edit CLIP representation using an MLLM to adaptively coordinate preservation and modification — outperforming prior metrics across five benchmarks and aligning closely with human judgment.

Yoonjeon Kim*, Soohyun Ryu*, Yeonsung Jung, Hyunkoo Lee, Joowon Kim, June Yong Yang, Jaeryong Hwang, Eunho Yang · *Equal Contribution

paper code project

The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects depending on the specific context. This is done by deriving the CLIP representation of an ideally edited image using a multi-modal large language model to augment textual descriptions, then calculating a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics.

BibTeX

@InProceedings{Kim_2025_CVPR,
    author    = {Kim, Yoonjeon and Ryu, Soohyun and Jung, Yeonsung and Lee, Hyunkoo and Kim, Joowon and Yang, June Yong and Hwang, Jaeryong and Yang, Eunho},
    title     = {Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {23474-23483}
}

News

May 2026 Paper accepted at ICML 2026 — Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

Jan 2026 Started Research Internship at SONY Research, Tokyo — mentored by Yuhta Takida and Chieh-Hsin (Jesse) Lai

Feb 2025 Paper accepted at CVPR 2025 — Preserve or Modify?

2024 Invited talk at AI Technology Seminar hosted by KAIST AI Graduate School

Publications

Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

Yoonjeon Kim*, Doohyuk Jang*, Eunho Yang · *Equal Contribution

ICML 2026 paper code

Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. MAPR utilizes a self-generated task of predicting rollout statistics — specifically length, pass-rate, and concepts used — allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by (i) filtering out trivial or unsolvable prompts, (ii) reducing lengthy generations that tend to be incorrect, and (iii) generating hints relevant to the problem. MAPR yields significant improvements in both accuracy and training efficiency on various reasoning benchmarks: a 1.28× GRPO training speedup, 83.18% gain in accuracy on AIME25, and 13.04% average gain over six mathematics benchmarks.

BibTeX

@inproceedings{
anonymous2026verifying,
title={Verifying Meta-Awareness via Predictive Rewards in Reasoning Models},
author={Yoonjeon Kim, Doohyuk Jang, Eunho Yang},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=Vl3tXPbjSH}
}

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

Under Review paper

Unified multimodal foundation models have enabled both visual understanding and generation within a single framework. Building on this foundation, supervised fine-tuning and reinforcement learning have been employed to facilitate interleaved visual and textual thinking, tightly coupling image generation with textual reasoning. Existing works predominantly rely on autoregressive unified models as the backbone, which incurs substantial computational overhead due to full regeneration of image token sequences at every rollout step. In this work, we instead leverage multimodal discrete diffusion models to develop a reinforcement learning framework for interleaved reasoning. By exploiting bidirectional context modeling, our approach enables localized visual editing, allowing targeted modifications and reducing rollout computation during GRPO by 26.9% compared to full-image editing baselines. We propose factorized reward assignment across text and vision streams to address spurious cross-modal credit assignment, yielding 11.2% gains over joint reward baselines and 38.04% improvements over the supervised fine-tuned model.

Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models

Doohyuk Jang*, Yoonjeon Kim*, Chanjae Park, Hyun Ryu, Eunho Yang · *Equal Contribution

Under Review paper code project

Large reasoning models have demonstrated remarkable proficiency in various tasks. However, we observe that they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term reasoning rigidity. Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning, leading to incorrect conclusions. This behavior presents significant challenges not only in reasoning-intensive domains but also in realistic settings. To systematically investigate reasoning rigidity, a behavior unexplored in prior work, we introduce a dataset, ReasoningTrap, including math problems, puzzles, and agentic tasks that diagnoses reasoning rigidity. Using this dataset, we identify patterns that occur when models default to ingrained reasoning, and suggest inference-level and GRPO-based post-training remedies. We will publicly release our diagnostic set to facilitate future research on mitigating reasoning rigidity in language models.

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

Yoonjeon Kim*, Soohyun Ryu*, Yeonsung Jung, Hyunkoo Lee, Joowon Kim, June Yong Yang, Jaeryong Hwang, Eunho Yang · *Equal Contribution

CVPR 2025 paper code project

The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image using a multi-modal large language model to augment textual descriptions, then calculating a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics.

BibTeX

@InProceedings{Kim_2025_CVPR,
    author    = {Kim, Yoonjeon and Ryu, Soohyun and Jung, Yeonsung and Lee, Hyunkoo and Kim, Joowon and Yang, June Yong and Hwang, Jaeryong and Yang, Eunho},
    title     = {Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {23474-23483}
}

Learning Input-agnostic Manipulation Directions in StyleGAN with Text Guidance

Yoonjeon Kim, Hyunsu Kim, Junho Kim, Yunjey Choi, Eunho Yang

ICLR 2023 paper

With the advantages of fast inference and human-friendly flexible manipulation, image-agnostic style manipulation via text guidance enables new applications that were not previously available. The state-of-the-art text-guided image-agnostic manipulation method embeds the representation of each channel of StyleGAN independently in the CLIP space, and provides it in the form of a Dictionary to quickly find the channel-wise manipulation direction during inference. However, this dictionary — constructed by controlling single channels individually — is limited in accommodating the versatility of text guidance since the collective and interactive relation among multiple channels is not considered. Indeed, it fails to discover a large portion of manipulation directions that can be found by existing methods which manually manipulate the latent space without texts. To alleviate this, we propose a novel method that learns a Dictionary whose entry corresponds to the representation of a single channel by taking into account the manipulation effect coming from the interaction with multiple other channels. We demonstrate that our strategy resolves the inability of previous methods in finding diverse known directions from unsupervised methods and unknown directions from random text, while maintaining real-time inference speed and disentanglement ability.

BibTeX

@inproceedings{
kim2023learning,
title={Learning Input-agnostic Manipulation Directions in Style{GAN} with Text Guidance},
author={Yoonjeon Kim and Hyunsu Kim and Junho Kim and Yunjey Choi and Eunho Yang},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=47B_ctC4pJ}
}

Sequential Targeting: A Continual Learning Approach for Data Imbalance in Text Classification

Joel Jang, Yoonjeon Kim, Kyoungho Choi, Sungho Suh

Expert Systems with Applications 2021 paper

Talks

AI Technology Seminar — KAIST AI Graduate School

↗ view announcement

Projects

National Research Foundation of Korea

A Study on Optimization and Network Interpretation Method for Large-Scale Machine Learning

Mar 2024 – Feb 2027

Intel Corporation & NAVER

Efficient Foundation Models on Intel Systems

Sep 2024 – Aug 2027

NAVER Cloud

Naver-KAIST Hyper-Creative Center

Sep 2021 – Aug 2023

Experience

Jan 2026 – Jun 2026 · Ongoing

SONY Research

Research Intern · Tokyo, Japan

Mentored by Yuhta Takida and Chieh-Hsin (Jesse) Lai.

Multi-modal Discrete Diffusion Model
Reinforcement Learning on Multi-modal Reasoning

Jul 2023 – Oct 2023

NAVER Cloud

Research Intern

Education

Ph.D.

Korea Advanced Institute of Science and Technology (KAIST)

Graduate School of AI

Mar 2023 – Present

M.S.

Korea Advanced Institute of Science and Technology (KAIST)

Graduate School of AI

Mar 2021 – Feb 2023

B.S.

Yonsei University

Applied Statistics

Mar 2017 – Feb 2021

Academic Service · Conference Reviewer

ICLR CVPR NeurIPS