🙋 Zhan Shaoxiong (詹少雄)
📧 zhansx24@mails.tsinghua.edu.cn / jasaxion@gmail.com
📍 Shenzhen, China · Tsinghua SIGS
I am an M.S. student at the Knowledge Engineering Lab at Tsinghua Shenzhen International Graduate School, supervised by Prof. Haitao Zheng. My research interests lie in natural language processing, information retrieval, and large and vision-language models. I received my B.Eng. in Computer Science from Huazhong Agricultural University.
🔥 News
📝 Selected Papers
Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, Fei Tan
RL-based framework to forge olympiad-level math problems from concept-explanation pairs. 9.8%–18.1% relative gains on AIME/Olympiad benchmarks.
Shaoxiong Zhan, Hai Lin, Hongming Tan, Xiaodong Cai, Hai-Tao Zheng
Lexical-semantic bridging to enhance fine-grained matching in dense retrieval without modifying backbone encoders.
Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Zijian Lin, Hai Lin, Xiaodong Cai, Shen Li, Hai-Tao Zheng
Identified VLM bottleneck in spatial tasks as lacking view-consistent intermediate representations; proposed "simulate-then-reason" mechanism with orthographic views.
Hongming Tan*, Shaoxiong Zhan*, Hai Lin, Hai-Tao Zheng, Wai Kin Chan
First unified text augmentation framework for dense retrieval via LLM-generated QA pairs and event structures.
Hongming Tan*, Shaoxiong Zhan*, Fengwei Jia, Hai-Tao Zheng, Wai Kin Chan
Hierarchical paper-section-QA decomposition framework for measuring research innovation with confidence-weighted aggregation.
Hai Lin*, Shaoxiong Zhan*, Junyou Su, Hai-Tao Zheng, Hui Wang
Zero-shot retrieval benchmark with 5 tasks and cross-lingual evaluation; introduced SSCI and RCCI metrics.
Shen Li, Li Huang, Shaoxiong Zhan, Weifeng Sun, Tao Yin, Zhongxin Liu, Meng Yan
Difficulty-aware dynamic routing to avoid overthinking in code generation; 46% token reduction with SOTA performance.
🎖 Honors and Awards
💻 Internships
- Building pretraining data pipelines for code domain: collected high-quality code dataset for pretrain & mid-train stages.
- Restructured GitHub Commit data processing pipeline; studying how data distribution impacts code generation capability.
- Proposed a multimodal issue localization benchmark, extending software engineering fault localization to scenarios involving UI screenshots and complex error logs, systematically studying VLM capabilities in code understanding.
- Focused on math reasoning enhancement for LLM foundation models via data-centric approaches.
- Designed RL-based hard-problem synthesis strategy, producing the MathSmith dataset that pushed benchmark performance on public math benchmarks. Published as first-author paper at AAAI'26.
- Contributed to VLM foundation model iteration for the "photo-solve" product line. Identified VLM spatial reasoning gaps, leading to first-author paper 3ViewSense.
- Delivered enterprise-grade RAG backend system with LangChain, embedding fine-tuning, RAG-fusion, and reranking.
- Implemented OCR + multimodal document processing; contributed to system design docs and client demos that secured partnerships & funding.
🛠 Skills
Tools: Claude Code, Cursor, Codex · Experienced with NAS, soft routers, and self-hosted infra
🎨 Miscellaneous
👋I'm a hands-on tech enthusiast who enjoys tinkering with gadgets and experimenting with new ideas💡—even if things break along the way. Outside of work, you'll find me playing badminton🏸, swimming🏊♀️, or dancing💃 (hiphop/kpop). Always happy to connect—feel free to add me on WeChat: Jasaxion_Taurus0405 🤝
