연구 아카이브

논문과 책, 웹사이트 등을 통해 공부하고 연구한 것들을 아카이브합니다.

참고 문헌과 스터디 노트, 그리고 재현가능한 소스코드를 함께 제공하고자 합니다.

Experimentation

Bao, W. (2023, March 28). How to Size For Online Experiments With Ratio Metrics. Expedia Group Technology. https://medium.com/expedia-group-tech/how-to-size-for-online-experiments-with-ratio-metrics-3d57362f1967
Blocker, C., Conway, J., Demortier, L., Heinrich, J., Junk, T., Lyons, L., & Punzi, G. (2006). Simple Facts about P -Values.
Deng, A., Lu, J., & Litz, J. (2017). Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 641–649. https://doi.org/10.1145/3018661.3018677
Fabijan, A., Dmitriev, P., Arai, B., Drake, A., Kohlmeier, S., & Kwong, A. (2023). A/B Integrations: 7 Lessons Learned from Enabling A/B testing as a Product Feature. 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 304–314. https://doi.org/10.1109/ICSE-SEIP58684.2023.00033
Gupta, S., Kohavi, R., Tang, D., Xu, Y., Andersen, R., Bakshy, E., Cardin, N., Chandran, S., Chen, N., Coey, D., Curtis, M., Deng, A., Duan, W., Forbes, P., Frasca, B., Guy, T., Imbens, G. W., Saint Jacques, G., Kantawala, P., … Yashkov, I. (2019). Top Challenges from the first Practical Online Controlled Experiments Summit. ACM SIGKDD Explorations Newsletter, 21(1), 20–35. https://doi.org/10.1145/3331651.3331655
Gupta, S., Ulanova, L., Bhardwaj, S., Dmitriev, P., Raff, P., & Fabijan, A. (2018). The Anatomy of a Large-Scale Experimentation Platform. 2018 IEEE International Conference on Software Architecture (ICSA), 1–109. https://doi.org/10.1109/ICSA.2018.00009 Huang, C., Tang, Y., & Tang, C. H. and Y. (2022, May 24). Meet Dash-AB — The Statistics Engine of Experimentation at DoorDash. DoorDash Engineering Blog. https://doordash.engineering/2022/05/24/meet-dash-ab-the-statistics-engine-of-experimentation-at-doordash/
Kohavi, R. (2023, October). Trustworthy A/B Tests: Causality and Pitfalls. Google Docs. https://drive.google.com/file/d/1mbwrCkR52kIfcjkQibI4lRMwOSJJjdv6/view?usp=sharing&usp=embed_facebook
Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T., & Xu, Y. (2012). Trustworthy online controlled experiments: Five puzzling outcomes explained. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 786–794. https://doi.org/10.1145/2339530.2339653
Kohavi, R., Deng, A., & Vermeer, L. (2022). A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3168–3177. https://doi.org/10.1145/3534678.3539160
Kohavi, R., & Longbotham, R. (2011). Unexpected results in online controlled experiments. ACM SIGKDD Explorations Newsletter, 12(2), 31–35. https://doi.org/10.1145/1964897.1964905
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. https://experimentguide.com/
Kohavi, R., Tang, D., Xu, Y., Hemkens, L. G., & Ioannidis, J. P. A. (2020). Online randomized controlled experiments at scale: Lessons and extensions to medicine. Trials, 21(1), 150. https://doi.org/10.1186/s13063-020-4084-y Machmouchi, W., Gupta, S., Zhang, R., & Fabijan, A. (2020, July 31). Patterns of Trustworthy Experimentation: Pre-Experiment Stage. Microsoft Research. https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/patterns-of-trustworthy-experimentation-pre-experiment-stage/
Microsoft. (2021, January 25). Patterns of Trustworthy Experimentation: During-Experiment Stage. Microsoft Research. https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/patterns-of-trustworthy-experimentation-during-experiment-stage/
Schultzberg, M., Kjellin, O., & Rydberg, J. (2020). Statistical Properties of Exclusive and Non-exclusive Online Randomized Experiments using Bucket Reuse (arXiv:2012.10202). arXiv. https://doi.org/10.48550/arXiv.2012.10202
Schultzberg, M., Kjellin, O., & Rydberg, J. (2021, March 10). Spotify’s New Experimentation Coordination Strategy. Spotify Engineering. https://engineering.atspotify.com/2021/03/spotifys-new-experimentation-coordination-strategy/
Thumbtack, E. (2020, June 1). Seedfinder—Infrastructure to Improve Sample Balance in Online A/B Tests. Thumbtack Engineering. https://medium.com/thumbtack-engineering/seedfinder-infrastructure-to-improve-sample-balance-in-online-a-b-tests-1b8c3ae7dbe8
Xie, H., & Aurisset, J. (2016). Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 645–654. https://doi.org/10.1145/2939672.2939733

Time Series

1 추론 모델링 · Regression

Spurious regression

나종화. R 응용 시계열분석. 자유아카데미. 2020.
여러 시계열로 회귀를 수행할 때, 꼭 주의해야 할 알아두어야할 사항
🔗 스터디 노트
🔗 R 튜토리얼: CCF 분석의 허구적 상관 확인 과정 참고

Regression with ARIMA errors

나종화. R 응용 시계열분석. 자유아카데미. 2020.
🔗 스터디 노트
🔗 R 튜토리얼

Distributed lag model

나종화. R 응용 시계열분석. 자유아카데미. 2020.
🔗 스터디 노트

Distributed lag non-linear model

Gasparrini, Antonio, Benedict Armstrong, and M.G. Kenward. “Distributed Lag Non-Linear Models.” Statistics in Medicine 29 (September 20, 2010): 2224–34. https://doi.org/10.1002/sim.3940.
Gasparrini, Antonio. “Distributed Lag Linear and Non-Linear Models in R: The Package Dlnm.” Journal of Statistical Software 43 (July 1, 2011): 1–20. https://doi.org/10.18637/jss.v043.i08.
🔗 스터디 노트
🔗 PPT
🔗 R 튜토리얼

2 예측모델링 · Forecasting

Exponential Smoothing

나종화. R 응용 시계열분석. 자유아카데미. 2020.
🔗 스터디 노트
🔗 R 튜토리얼: tidyverse principle로 시계열 자료분석하기

ARIMA model

나종화. R 응용 시계열분석. 자유아카데미. 2020.
🔗 스터디 노트

Prophet

Taylor, Sean, and Benjamin Letham. Forecasting at Scale, 2017. https://doi.org/10.7287/peerj.preprints.3190v2.
🔗 스터디 노트
🔗 R 튜토리얼

Hierarchical Time Series Forecasting

Athanasopoulos, George, Roman A. Ahmed, and Rob J. Hyndman. “Hierarchical Forecasts for Australian Domestic Tourism.” International Journal of Forecasting 25, no. 1 (January 1, 2009): 146–66. https://doi.org/10.1016/j.ijforecast.2008.07.004.
Athanasopoulos, George, Rob Hyndman, Roman Ahmed, and Han Lin Shang. “Optimal Combination Forecasts for Hierarchical.” Computational Statistics & Data Analysis 55 (September 1, 2011): 2579–89. https://doi.org/10.1016/j.csda.2011.03.006.
Hyndman, Rob J, George Athanasopoulos, and Han Lin Shang. “Hts: An R Package for Forecasting Hierarchical or Grouped Time Series,” n.d., 12.
🔗 스터디 노트
🔗 R 튜토리얼

3 Other techniques

Intervention analysis (Interrupted Time Series)

Slides. “Intervention Analysis.” Accessed April 17, 2022. https://slides.com/tonyg/intervention-analysis.
🔗 참고 자료
🔗 스터디 노트
🔗 R 코드
🔗 R 코드: arimax() 튜토리얼

Dynamic Time Warping (DTW)

Berndt, Donald J., and James Clifford. “Using Dynamic Time Warping to Find Patterns in Time Series.” In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 359–70. AAAIWS’94. Seattle, WA: AAAI Press, 1994.
선행 또는 후행하는 시계열, 시차가 존재하나 유사한 패턴이 존재하는 두 시계열을 잡아낼 수 있게끔 해주는 비유사성 측도(거리 측도) 알고리즘
DTW distance를 이용해 계층적 군집 분석 수행 가능
🔗 스터디 노트
🔗 R 튜토리얼

Discrete Wavelet Transform (DWT)

Graps, Amara. “An Introduction to Wavelets.” IEEE Comp. Sci. Engi. 2 (February 1, 1995): 50–61. https://doi.org/10.1109/99.388960.
Li, Daoyuan, Tegawendé F. Bissyandé, Jacques Klein, and Y. L. Traon. “Time Series Classification with Discrete Wavelet Transformed Data: Insights from an Empirical Study.” In SEKE, 2016. https://doi.org/10.18293/SEKE2016-067.
시계열들을 데이터의 열로 나열하여 classification을 수행할 때, 효과적인 차원 감소 방법
일종의 시계열 Feature engineering 기법에 해당
🔗 스터디 노트
🔗 R 튜토리얼

Statistical/Machine Learning

Prerequisite

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
🔗 스터디 노트: Prerequisite 1 머신러닝 용어 정리

Ensemble methods

Chen, Tianqi, and Carlos Guestrin. “XGBoost: A Scalable Tree Boosting System.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 13, 2016, 785–94. https://doi.org/10.1145/2939672.2939785.
Chen, Lilly. “Basic Ensemble Learning (Random Forest, AdaBoost, Gradient Boosting)- Step by Step Explained.” Medium, January 2, 2019. https://towardsdatascience.com/basic-ensemble-learning-random-forest-adaboost-gradient-boosting-step-by-step-explained-95d49d1e2725.
Morde, Vishal. “XGBoost Algorithm: Long May She Reign!” Medium, April 8, 2019. https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d.
“Light GBM vs XGBOOST: Which Algorithm Takes the Crown.” Accessed April 17, 2022. https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/.
Random Forest, AdaBoost, Gradient Boosting, XGBoost, Light GBM
🔗 스터디 노트
🔗 R 튜토리얼: tidyverse principle로 머신러닝하기

Logistic regression

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. “An Introduction to Statistical Learning.” An Introduction to Statistical Learning. Accessed April 17, 2022. https://www.statlearning.com.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer, 2009. http://www-stat.stanford.edu/~tibs/ElemStatLearn/.
StatQuest with Josh Starmer. Logistic Regression Details Pt 2: Maximum Likelihood, 2018. https://www.youtube.com/watch?v=BfKanl1aSG0.
Chatterjee, Samprit, and Ali S. Hadi. “Regression Analysis by Example, Fifth Edition.”
🔗 스터디 노트

Generalized Linear Model (GLM) and Generalized Additive Model (GAM)

Hayes, Genevieve. “Beyond Linear Regression: An Introduction to GLMs.” Medium, December 24, 2019. https://towardsdatascience.com/beyond-linear-regression-an-introduction-to-glms-7ae64a8fad9c.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. “An Introduction to Statistical Learning.” An Introduction to Statistical Learning. Accessed April 17, 2022. https://www.statlearning.com.
GLM
- 🔗 스터디 노트
GAM

Deep Learning

Prerequisites

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
🔗 스터디 노트: Prerequisite 1 딥러닝의 모티베이션과 역사
🔗 스터디 노트: Prerequisite 2 선형대수의 여러 객체 소개
🔗 스터디 노트: Prerequisite 3 행렬의 전치와 브로드캐스팅
🔗 스터디 노트: Prerequisite 4 행렬과 벡터의 곱연산
🔗 스터디 노트: Prerequisite 5 선형방정식과 선형종속,span
🔗 스터디 노트: Prerequisite 6 norms
🔗 스터디 노트: Prerequisite 7 특별한 종류의 행렬과 벡터
🔗 스터디 노트: Prerequisite 8 고윳값 분해
🔗 스터디 노트: Prerequisite 9 특잇값 분해와 일반화 역행렬
🔗 스터디 노트: Prerequisite 10 Trace 연산자와 행렬식
🔗 스터디 노트: Prerequisite 11 선형대수를 이용한 주성분 유도
🔗 스터디 노트: Prerequisite 12 머신러닝 용어 정리

High-Dimensional Data Analysis

Breheny, Patrick. High-Dimensional Data Analysis. The University of Iowa, 2016. https://myweb.uiowa.edu/pbreheny/7600/s16/index.html.
- 🔗 R 소스코드 및 예제 Dataset 제공
일반적인 기계학습 기반의 예측 모델링으로 접근하기 어려운 n -> p 또는 n < p 인 자료의 예측 모델링에 관한 방법론(여기서 n은 관측치의 수, p는 예측변수의 수)
꼭 고차원 자료가 아닌, 회귀모형의 예측 성능을 높이기 위해서도 사용되는 방법론들에 해당
통계적 가설검정 관점에서 가설 검정시 발생하는 고차원 문제에 관한 솔루션 또한 제공함

1 고차원 자료에 관한 예측 모델링

2 통계적 가설검정 관점의 고차원 문제

Prerequisites

Family-Wise Error Rates (FWER)

🔗 스터디 노트

False Discovery Rates (FDR)

🔗 스터디 노트

Statistics

통계학, 통계적 가설검정과 관련한 것들을 아카이브 합니다.

구간추정의 해석에 대한 고전적 관점(Frequentist)과 베이지안 관점

🔗 스터디 노트

검정력(power)과 검정력 함수에 대해

🔗 스터디 노트

자유도(Degrees of Freedom)

🔗 스터디 노트

표준편차와 표준오차

🔗 스터디 노트

“대립가설이 옳다.”라는 식의 주장을 지양해야하는 이유

🔗 스터디 노트

중심극한정리의 의미

Fixed effect와 random effect

🔗 스터디 노트

Miscellaneous

결정론적 SIR 모형을 이용한 감염병 유행 모델링

🔗 스터디 노트와 R 튜토리얼

Experimentation

Time Series

1 추론 모델링 · Regression

Spurious regression

Regression with ARIMA errors

Distributed lag model

Distributed lag non-linear model

2 예측모델링 · Forecasting

Exponential Smoothing

ARIMA model

Prophet

Hierarchical Time Series Forecasting

3 Other techniques

Intervention analysis (Interrupted Time Series)

Dynamic Time Warping (DTW)

Discrete Wavelet Transform (DWT)

Statistical/Machine Learning

Prerequisite

Ensemble methods

Logistic regression

Generalized Linear Model (GLM) and Generalized Additive Model (GAM)

Deep Learning

Prerequisites

High-Dimensional Data Analysis

1 고차원 자료에 관한 예측 모델링

Prerequisites

Ridge regression

Lasso regression

Bias reduction of Lasso estimator

Variance reduction of Lasso eistimator

Penalized logistic regression

Penalized robust regression

2 통계적 가설검정 관점의 고차원 문제

Prerequisites

Family-Wise Error Rates (FWER)

False Discovery Rates (FDR)

Statistics

구간추정의 해석에 대한 고전적 관점(Frequentist)과 베이지안 관점

검정력(power)과 검정력 함수에 대해

자유도(Degrees of Freedom)

표준편차와 표준오차

“대립가설이 옳다.”라는 식의 주장을 지양해야하는 이유

중심극한정리의 의미

Fixed effect와 random effect

Miscellaneous

결정론적 SIR 모형을 이용한 감염병 유행 모델링