工具以及框架总结

语言
- python
包
- numpy
- pandas
- sqlalchemy
- lxml
- html5lib
- BeautifulSoup4
- nltk
数据
- UCI Data set
数据可视化
- Matplotlib
- Seaborn
- Pandas内嵌数据可视化
- Plotly and Cufflinks
- Geographical Plotting
框架
- Scikit Learn
- tensorflow
- Hadoop
- Spark
算法
- Linear Regression
- Logistic Regression
- K Nearest Neighbors
- Decision Trees and Random Forests
- Support Vector Machines
- K Means Clustering
- Natural Language Processing
- Neural Nets and Deep Learning
- Native Bayes
方法
- Cross Validation and Bias-Variance Trade-off
- Principal Component Analysis
云服务
- AWS EC2
模型版本管理和部署
- PMML
- CI: Jekins
- docker
监控模型
- 资源利用率
  - CPU: top
  - Memory: free -m
  - Disk: df and du
  - Network: netstat
- 服务可用性
  - 核心进程是否还在跑？
  - Api 终端可达么？
  - Api 返回非错误信息么？
  - Api 超时么？
- 吞吐量
  - 交易量
  - 请求的积压
- 模型质量
  - 样本预测
  - 评估准确率
  - 观查模型漂移
  - 识别新的训练实例
环境
- development
- staging（intergartion testing）
- production
- canary deployment, Blue/green deployment