Yao-Chun Chan

Can be Remote Working ychan199@ucr.edu

I am experienced in appling/creating machine Learning(ML) algorithms/models and buiding efficient systems to work with it in real world's scenario. Moreover, I have dealt with different business requirments and developed Personalized Intelligent Models/Systems. I utilized Natural Language Preprocessing(NLP) & Data Mining techniques, knowledge of ML & Deep Learning to solve most of critical problems.

The domain I focused recently is about Efficient and Fairness ML . I worked with my advisor Samet Oymak at University of California, Riverside to create less labeled-data hungry, more automatic and robust algorithms for next generation's ML. I also gained more knowledge on the domain of Computer Vision (CV). Furthermore, at my CS degree I also concentrated on Big Data Infrastractures and Database Systems to hope that I can make data systems works more efficient and aligned with machine learning models.


Skills Sets

Machine Learning/Deep Learning & Data Mining

Statistical Approach & Data Analysis/Visualization

CV & NLP

Big Data & Systems

Web Development

Project Management

Relevant Work Experience

National Taiwan University

Machine Learning Engineer

Developed scalable and 10,000 + users’ NLP-based ML pipeline libraries by python and put models into production:
1.Collaborative Embedding-Based Movie Recommender, Taiwan Mobile
Applied a movie-content’s embedding-based (Bert, Doc2vec) Neural Collaborative Filtering model for millions-users’ video App of Taiwan Mobile to increase over 20% click through rate compared to baseline models in high sparsity scenario.

- Implemented the architecture with unsupervised-learning loss function, sampling function and embedding model.
Ideas mainly from Embedding-based News Recommendation for Millions of Users , it combined content-based + user preference
- Integrated multi-head attention, CNN into user embedding component of the recommender system by Keras, Tensorflow to enhance performance by 1/4 times MRR.
- Solved space utilization problem by retrieving the target items’ vector instead of whole matrices.
2.Danger Cargo Detection by Texts, WAN HAI Lines
Developed a Machine Learning API which was deployed on AWS to automate the process of detecting dangerous shipping cargo by multiple-format, multilingual texts-data and saved 100,000 hours for WAN HAI Lines LTD daily.

- Designed python module to clean text, translate and NLP tokenize multilingual documents to single language format by analyzing words' distribution of all the existing documents.
- Built n-gram Tf-idf Navie Baye SVM up to 90% accuracy and 78% recall under imbalance data scenario for classifying multilingual documents.
- Developed fault-tolerant ETL/data pipeline with Restful OCR for documents’ preprocessing by NLTK and python scripts.
- Implemented words searching module for documents and web query module with beautiful soup package for chemical materials by python scripts to provide additional information for verifying confidence level of Tf-idf Navie Baye SVM.
- Developed a char-level seq2seq model for helping to match the similar words (synonyms) and tokens which users want to search.
3.Financial Forecasting by News, CTBC Bank
Developed the embedding-based Neural Time-Series Model by Keras and NLP pipeline with millions of Reuter’s News to help 2,000+ employees in CTBC bank predict financial indicator.

- Provided the important news retrieved by the model additionally for Causal and Inference of the financial investment decision.
July 2018 - Aug 2019

Publication

On the Marginal Benefit of Active Learning: Does Self-Supervision Eat Its Cake?

Yao-Chun Chan*, Mingchen Li*, Samet Oymak
IEEE International Conference on Acoustics, Speech and Signal Processing 2021 (ICASSP)

Active learning is the set of techniques for intelligently labeling large unlabeled datasets to reduce the labeling effort. In parallel, recent developments in self-supervised and semi-supervised learning (S4L) provide powerful techniques, based on data-augmentation, contrastive learning, and self-training, that enable superior utilization of unlabeled data which led to a significant reduction in required labeling in the standard machine learning benchmarks. A natural question is whether these paradigms can be unified to obtain superior results. To this aim, this paper provides a novel algorithmic framework integrating self-supervised pretraining, active learning, and consistency-regularized self-training.

Responsibility:
1. Initiated the idea, question and big picture of this paper.
2. Studied state-of-art self-supervision, semi-supervised, active learning algorthims and modified them by Pytorch to fit the task.
3. Observed algorithms' behaviors, then made the main conclusions of this paper.

A Hybrid Approach for Hotel Recommendation

Kung-Hsiang Huang, Yi-Fu Fu , Yi-Ting Lee , Tzong-Hann Lee, Yao-Chun Chan, Yi-Hui Lee, Shou-De Lin
Proceedings of the Workshop on ACM Recommender Systems Challenge 2019

Session-based recommender system refers to a specific type of recommender system that focuses more on the transactional structure of each session rather than the user and item interactions. It is stated that the users' interactions are mostly homogeneous in the same sessions, while being heterogeneous across different sessions. Therefore, it is essential to extract the interest dynamics of users within each session. The 2019 ACM Recsys Challenge aims to apply session-based recommender systems to the domain of travel metasearch. The goal is to predict which hotels are clicked in the search results based on the context of each session. In this paper, we propose our approach to effectively tackle the challenge. It involves an ensemble of three models, LightGBM, XGBoost, and a Neural Network based on DeepFM that is capable of handling sequential features. Our team, RosettaAI, won the 4th place...

Responsibility:
1. Discovered some useful features.
2. Optimized deep learning model for sequential-format data (session-based) and Light-GBM for structured data.
3. Studied and seeked possible unsupervised-learning approach, such as extracting information by generating attention weights on the attributes.


Education

University of California - Riverside

Master of Science
Computer Science

GPA: 3.8

Sep 2019 - Jun 2020

National Taiwan University Sci & Tech, National Taiwan Normal University

Certificate of Compueter Science, totally 27 credits

GPA: 4.22/4.3, Foundation of Computer Science, data structure/algorithm/operating systems/computer organization(C, C++, Java)

Feb 2009 - Jun 2013


Awards

  • 4-th Place ACM RecSys Challenge - June, 2019
  • Top 1% Place - Kaggle Competition - Home Credit Default Risk - Aug, 2018.
    Developed a ML system to predict the risk of the credits (fraud detection) by user’s information and historical transactions.
  • Top 5% Place - Kaggle Competition - Toxic Comment Challenge - Mar, 2018.
    Used Multi-task models with different level text embedding (char,word,N-gram) to identify toxic comments (sentimental analysis) on social media platform.

Projects

Twitter Search Engine and Topic Modeling, containerized by docker

- Built several index methods by Pyspark, PyLucence, and MapReduce (Mrjob) for selecting features of tweets.
- Applied MongoDB DataBase (NOSQL) for storing the tweets and connected the dataflows by Django and PyMongo.
- Developed ranking function for searching tweets, Topic Model (LDA) for clustering tweets and evaluated both of them.
- Developed the interface (React) to show the searching results and visualize the locations of tweets.
Winter 2020, Fall 2020

Linux module - Final Project of Advanced Operating Systems Course

- Utilized C++ to implement hash table, link list for tracing the memory allocation of the Linux system and preventing error.
Mar 2020

Sampling on Probability Graphical Model - Final Project of Probability Model for AI Course

- Made the comparison of Gibbs sampling and weight likelihood sampling with different structued of Bayesian Networks and found the properties of each method.
- Developed the algorithm to generate multiple Bayesian Networks with different conditions automatically in a batch.
Mar 2020

Context-Awareness Music Recommender system for KKBOX - Final Project of Data Mining Course

- Developed a hybrid Neural Collaborative Filtering approach, such as WideAndDeep NN, DeepFM, Content Embedding-based Neural collaborative recommender system to beat the state-of-art in the WSDM Cup 2018 with same accuracy but less features.
- Built Variational AutoEncoder(VAE) to learn the latent space information of the multi-hot encode for the users' attributes and passed to the collebrative filtering model.
Fall 2019

KDD Cup 2018: PM 2.5 Prediction

- Integrated mutiple table data efficiently by Pandas with SQL concept.
- Built dynamic map for tracking PM 2.5 based on time and locations by Ploty libraries.
- Developed the graph model for analyzing PM 2.5 on the map.
Nov 2018

Wait for updating

...
...

Contact Me

Swing by for a cup of coffee, or leave me a message: