数据科学
-
语言
- python
-
包
- numpy
- pandas
- sqlalchemy
- lxml
- html5lib
- BeautifulSoup4
- nltk
-
数据
-
数据可视化
- Matplotlib
- Seaborn
- Pandas内嵌数据可视化
- Plotly and Cufflinks
- Geographical Plotting
-
框架
- Scikit Learn
- tensorflow
- Hadoop
- Spark
-
算法
- Linear Regression
- Logistic Regression
- K Nearest Neighbors
- Decision Trees and Random Forests
- Support Vector Machines
- K Means Clustering
- Natural Language Processing
- Neural Nets and Deep Learning
- Native Bayes
-
方法
- Cross Validation and Bias-Variance Trade-off
- Principal Component Analysis
-
云服务
- AWS EC2
-
模型版本管理和部署
- PMML
- CI: Jekins
- docker
-
监控模型
- 资源利用率
- CPU: top
- Memory: free -m
- Disk: df and du
- Network: netstat
- 服务可用性
- 核心进程是否还在跑?
- Api 终端可达么?
- Api 返回非错误信息么?
- Api 超时么?
- 吞吐量
- 交易量
- 请求的积压
- 模型质量
- 样本预测
- 评估准确率
- 观查模型漂移
- 识别新的训练实例
- 资源利用率
-
环境
- development
- staging(intergartion testing)
- production
- canary deployment, Blue/green deployment