About Me

I am a Postdoctoral Research Fellow from the Schmidt AI in Science Fellowship program at University of Michigan, mentored by Professor Yixin Wang from Department of Statistics, and Professor Bryan R. Goldsmith from Department of Chemical Engineering.

I completed Ph.D. in Statistics at New York University in 2023, advised by Professor Halina Frydman, with my dissertation focused on adapting machine-learning models to address challenging problems in medical and physical sciences. Before that, I completed M.A. in Statistics at Columbia University in 2017, and B.A. in Finance and Applied Mathematics at Wuhan University in 2015.

My research interests include probabilistic generative modeling and their applications in materials discovery. See my CV.

As a Schmidt AI in Science Fellow,

  • I am currently developing large foundational generative models for molecular design under data-scarce conditions to advance scientific discovery in chemistry and materials science. My research leverages cutting-edge architectures—including flow matching and transformer-based models to generate novel chemical structures with desired properties.
  • Recognizing that these generative approaches often require abundant, representative datasets and may struggle when target properties fall outside the observed training distribution, I aim to tackle the challenge of limited data through extrapolative methodology design and strategic experimental design.
  • I am also working on developing an autonomous LLM-driven Agentic AI copilot that manages end-to-end transition-state search workflows from user problem specification to converged structures. The agent tailors methods and settings to each reaction environment, reducing expert effort and improving robustness in computational catalysis.


Research

Preprints and workshop papers

  • W. Yao, B. Dumitrascu, B. R. Goldsmith and Y. Wang
    Goal-oriented influence-based active data acquisition
    In preparation for submission to Journal of Machine Learning Research, 2025.
  • W. Yao, C. Gruich, B. R. Goldsmith and Y. Wang
    Tail extrapolative conditional molecule generation
    In Proceedings of the International Conference on Machine Learning (ICML) AI4Science Workshop, 2024.  
    link
  • W. Yao, K. Storey-Fisher, D. W. Hogg and S. Villar
    A simple equivariant machine learning method for dynamics based on scalars
    In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) 2021 Workshop on Machine Learning and the Physical Sciences, 2021.  
    arXiv code

Publications

  • S. Villar, D. W. Hogg, W. Yao, G. A. Kevrekidis and B. Sch\"{o}lkopf
    Towards fully covariant machine learning
    Transactions on Machine Learning Research, 2024.  
    arXiv code
  • S. Villar, W. Yao, D. W. Hogg, B. Blum-Smith and B. Dumitrascu
    Dimensionless machine learning: Imposing exact units equivariance
    Journal of Machine Learning Research 24, 2023.  
    arXiv code
  • W. Yao, H. Frydman, D. Larocque and J. S. Simonoff
    Ensemble methods for survival function estimation with time-varying covariates
    Statistical Methods in Medical Research, 31(11):2217-2236, 2022.  
    pdf link code
  • S. Villar, D. W. Hogg, K. Storey-Fisher, W. Yao and B. Blum-Smith
    Scalars are universal: Equivariant machine learning, structured like classical physics.
    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021.
    arXiv code
  • H. Moradian, W. Yao, D. Larocque, J. S. Simonoff and H. Frydman
    Dynamic estimation with random forests for discrete-time survival data.
    The Canadian Journal of Statistics, 50(2):533-548, 2021.
    pdf link code
  • W. Yao, H. Frydman and J. S. Simonoff
    An ensemble method for interval-censored time-to-event data.
    Biostatistics, 22(1):198-213, 2021.
    pdf link code
  • W. Yao, A. S. Bandeira and S. Villar
    Experimental performance of graph neural networks on random instances of max-cut.
    Proceedings of the Society of Photographic Instrumentation Engineers, 2019.
    pdf link code
  • J. H. Lee, D. E. Carlson, H. S. Razaghi, W. Yao, G. A. Goetz, E. Hagen, E. Batty, E. J. Chichilnisky, G. T. Einevoll and L. Paninski.
    YASS: Yet Another Spike Sorter.
    Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017.
    pdf link code

Experience

I was a Science Intern in Amazon Development Center, Germany from December 2022 to May 2023 working on transfer learning with deep tabular models.

I was a Research Intern in the Applied Machine Learning Team, TikTok at ByteDance from May 2022 to August 2022, working on deriving Maximum likelihood estimation with full-likelihood formulation for click-through rate prediction on nonuniform subsampled data.

I was the Course Instructor of STAT-UB1.001 Statistics for Business Control at NYU Stern in Summer 2020.

I was a teaching fellow at New York University for

  • XBA1-GB.8314: Operations Analytics (Summer 2021, Fall 2020, Spring 2019)
  • STAT-GB.3205: Analytics & Machine Learning for Managers (Spring 2021)
  • STAT-GB.3321: Introduction to Stochastic Processes (Spring 2021)
  • STAT-UB.0103: Statistics for Business Control Regress & Forecasting Models (Fall 2020, Summer 2020)
  • COR1-GB.1305: Statistcs and Data Analysis (Fall 2018)

I was a rearch assistant with Professor Liam Paninski at Grossman Center for the Statistics of Mind, Columbia University, from 2016 to 2017. We worked on projects that develop statistical methodology for understanding how neurons encode information.


Education

B.A. in Finance & B.A. in Applied Mathematics 2011-2015

Economics and Management School & Mathematics and Statistics School, Wuhan University