Bibliography

Bibliography#

[1]

A. Jung. Machine Learning: The Basics. Springer, Singapore, 2022.

[2]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. arXiv:1706.03762.

[3]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.

[4]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.

[5]

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: multitask learning as question answering. 2018. arXiv:1806.08730.

[6]

Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. 2017. arXiv:1706.05137.

[7]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. 2017. arXiv:1703.03400.

[8]

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. Dive into Deep Learning. Cambridge University Press, 2023. URL: https://D2L.ai.

[9]

Minhyeok Lee. A mathematical interpretation of autoregressive generative pre-trained transformer and self-supervised learning. Mathematics, 2023. URL: https://www.mdpi.com/2227-7390/11/11/2451, doi:10.3390/math11112451.

[10]

Phillip Lippe. UvA Deep Learning Tutorials. https://uvadlc-notebooks.readthedocs.io/en/latest/, 2023.

[11]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. 2014. arXiv:1409.0473.

[12]

Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [Submitted on 22 Dec 2014 (v1)]. URL: https://arxiv.org/abs/1412.6980.

[13]

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. [Submitted on 14 Nov 2017 (v1)]. URL: https://arxiv.org/abs/1711.05101.

[14]

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019. [Submitted on 8 Aug 2019 (v1), last revised 26 Oct 2021 (this version, v4)]. URL: https://arxiv.org/abs/1908.03265.

[15]

Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. CoRR, 2016. URL: http://arxiv.org/abs/1608.03983, arXiv:1608.03983.

[16]

Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 1 edition, 2007. ISBN 0387310738. URL: http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738%3FSubscriptionId%3D13CT5CVB80YFWJEPWS02%26tag%3Dws%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3D0387310738.

[17]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

[18]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press, 4 edition, 2022.

[19]

Runestone Interactive. Problem solving with algorithms and data structures using python. 2023. URL: https://runestone.academy/ns/books/published/pythonds3/index.html.

[20]

Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for Machine Learning. Cambridge University Press, 2020. URL: https://mml-book.github.io/.

[21]

Joseph Muscat. Functional Analysis: An Introduction to Metric Spaces, Hilbert Spaces, and Banach Algebras. Springer, 2014.