Yuki Uchino

内野佑基

UCHINO Yuki

連絡先:

yuki.uchino.fe (at) riken.jp

UCHINO, Yuki

CONTACT:

yuki.uchino.fe (at) riken.jp

Profile

名前:		内野佑基
メール:		yuki.uchino.fe (at) riken.jp
所属:		理化学研究所計算科学研究センター
住所:		〒650-0047 兵庫県神戸市中央区港島南町 7-1-26
身長:		りんご14個
体重:		りんご230個
所属学会:		日本応用数理学会 (2022.07-), 情報処理学会 (2025.04-)
トピック:		混合精度数値計算, 数値線形代数, 精度保証付き数値計算

Name:		Yuki UCHINO
Email:		yuki.uchino.fe (at) riken.jp
Affiliation:		RIKEN Center for Computational Science
Address:		7-1-26 Minatojima-minami-machi, Chuo-ku, Kobe, Hyogo 650-0047, Japan
Pronoun:		he / him / his
Memberships:		The Japan Society for Industrial and Applied Mathematics, Information Processing Society of Japan
Interests:		Mixed-Precision Computing, Numerical Linear Algebra, Self-Validating Methods

Curriculum vitae

1997

1997/10/02

生まれました

2016

2016/04

芝浦工業大学システム理工学部数理科学科入学

2020

2020/03

大学卒業 (総代・首席)
学士 (数理科学) 取得

2020/04

同大学院理工学研究科システム理工学専攻修士課程入学

2022

2022/03

修士課程修了 (総代・首席)
修士 (システム理工学) 取得

2022/04

同大学院理工学研究科機能制御システム専攻博士後期課程入学
日本学術振興会特別研究員 DC1 採用

2024

2024/03

博士後期課程修了 (短縮)
博士 (工学) 取得
日本学術振興会特別研究員 DC1 中途辞退

2024/04

国立研究開発法人理化学研究所計算科学研究センター大規模並列数値計算技術研究チーム特別研究員採用

2025

2025/04

情報処理学会論文誌コンピューティングシステム（ACS）編集委員就任

1997

October 02, 1997

Born

2016

April 2016

Enrolled in Department of Mathematical Sciences, College of Systems Engineering and Science, Shibaura Institute of Technology

2020

March 2020

Received B.S. in Mathematical Sciences

April 2020

Enrolled in Systems Engineering and Science, Graduate School of Engineering and Science, Shibaura Institute of Technology (Master's Program)

2022

March 2022

Received M.S. in Systems Engineering and Science

April 2022

Enrolled in Functional Control Systems, Graduate School of Engineering and Science, Shibaura Institute of Technology (Doctoral Program)
Appointed as Research Fellowships for Young Scientists DC1, Japan Society for the Promotion of Science

2024

March 2024

Received Ph.D. in Engineering
Resigned as Research Fellowships for Young Scientists DC1

April 2024

Appointed as a Postdoctoral Researcher in Large-Scale Parallel Numerical Computing Technology Research Team, RIKEN Center for Computational Science, RIKEN

2025

April 2025

Appointed as IPSJ Transactions on Advanced Computing Systems Editorial Committee Member

Recent Works

Aug. 06, 2025 arXiv

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Y. Uchino, K. Ozaki, T. Imamura

                            @misc{uchino2025highperformancepowerefficientemulationmatrix,
                                title={High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines}, 
                                author={Yuki Uchino and Katsuhisa Ozaki and Toshiyuki Imamura},
                                year={2025},
                                eprint={2508.03984},
                                archivePrefix={arXiv},
                                primaryClass={cs.DC},
                                url={https://arxiv.org/abs/2508.03984}, 
                            }

Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines. In this study, we present emulation methods that significantly outperforms conventional approaches. On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems. The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems. Furthermore, compared to conventional emulation methods, the proposed emulation achieves more than 2x higher performance and superior power efficiency.

Apr. 10, 2025 arXiv

Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique

K. Ozaki, Y. Uchino, T. Imamura

                            @misc{ozaki2025ozakischemeiigemmoriented,
                                title={{Ozaki Scheme II}: A {GEMM}-oriented emulation of floating-point matrix multiplication using an integer modular technique}, 
                                author={Katsuhisa Ozaki and Yuki Uchino and Toshiyuki Imamura},
                                year={2025},
                                eprint={2504.08009},
                                archivePrefix={arXiv},
                                primaryClass={cs.MS},
                                doi={10.48550/arXiv.2504.08009},
                                url={https://arxiv.org/abs/2504.08009}, 
                            }

This paper addresses emulation algorithms for matrix multiplication. General Matrix-Matrix Multiplication (GEMM), a fundamental operation in the Basic Linear Algebra Subprograms (BLAS), is typically optimized for specific hardware architectures. The Ozaki scheme is a well-established GEMM-based emulation method for matrix multiplication, wherein input matrices are decomposed into several low-precision components to ensure that the resulting matrix product is computed exactly through numerical operations. This study proposes a novel GEMM-based emulation method for matrix multiplication that leverages the Chinese Remainder Theorem. The proposed method inherits the computational efficiency of highly optimized GEMM routines and further enables control over the number of matrix multiplications, which can enhance computational accuracy. We present numerical experiments featuring INT8 Tensor Core operations on GPUs and FP64 arithmetic on CPUs as case studies. The results demonstrate that FP64 emulation using the proposed method achieves performance levels of up to 7.4 to 9.8 TFLOPS on the NVIDIA RTX 4090 and 56.6 to 80.2 TFLOPS on the NVIDIA GH200, exceeding the measured performance of native FP64 arithmetic. Furthermore, for FP64 computations on CPUs, the proposed method achieved up to a 2.3x speedup in emulating quadruple-precision arithmetic compared to the conventional Ozaki scheme.

Mar. 25, 2025 Journal of Advanced Simulation in Science and Engineering

Fast Generation of Real-Symmetric Matrices and their Exact Eigenpairs

Y. Uchino, K. Ozaki, T. Terao, T. Imamura

                            @article{uchino2025fast,
                                title={Fast Generation of Real-Symmetric Matrices and their Exact Eigenpairs},
                                author={Yuki Uchino and Katsuhisa Ozaki and Takeshi Terao and Toshiyuki Imamura},
                                journal={Journal of Advanced Simulation in Science and Engineering},
                                volume={12},
                                number={1},
                                pages={44-60},
                                year={2025},
                                doi={10.15748/jasse.12.44}
                            }

We propose a method to rapidly generate matrices for real-symmetric eigenproblems. The proposed method produces a reproducible matrix with explicit eigenpairs, where the distribution of the eigenvalues can be controlled by a user. All elements of the generated matrix are rigorous floating-point numbers and can be represented in simple expressions involving the exact eigenvalues. The exact eigenpairs of the generated matrix are known in advance; thus, the proposed method contributes to the validation of errors in approximate eigenpairs. Several constraints on the matrix generated by the proposed method, were produced theoretically.

Jan. 09, 2025 The International Journal of High Performance Computing Applications

Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit

Y. Uchino, K. Ozaki, T. Imamura

                            @article{uchino2025performance,
                                author = {Yuki Uchino and Katsuhisa Ozaki and Toshiyuki Imamura},
                                title ={Performance enhancement of the {Ozaki} Scheme on integer matrix multiplication unit},
                                journal = {The International Journal of High Performance Computing Applications},
                                volume = {39},
                                number = {3},
                                pages = {462-476},
                                year = {2025},
                                doi = {10.1177/10943420241313064},
                            }

This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix multiplications in machine learning models, and next-generation architectures are expected to follow the same design principle. The key to achieving superior performance is to fully leverage such architectures. The Ozaki scheme, a highly accurate matrix multiplication algorithm using error-free transformations, enables higher-precision matrix multiplication to be performed through multiple lower-precision matrix multiplications and higher-precision matrix additions. Ootomo et al. implemented the Ozaki scheme on high-performance matrix multiplication units with the aim of achieving both sufficient accuracy and high performance. This paper proposes alternative approaches to improving performance by reducing the numbers of lower-precision matrix multiplications and higher-precision matrix additions. Numerical experiments demonstrate the accuracy of the results and conduct performance benchmarks of the proposed approaches. These approaches are expected to yield more efficient results in next-generation architectures.

業績一覧はこちら