Вы находитесь на странице: 1из 35

云深度学习平台架构与实践

陈迪豪 / 崔建伟
About Us

陈迪豪 崔建伟
第四范式先知平台架构师 ⼩小⽶米深度学习平台架构师
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
Define Cloud Machine Learning
! What is Machine Learning

CNN MLP RNN/LSTM RL


Define Cloud Machine Learning
! What is Cloud Machine Learning

TensorFlow TensorFlow MXNet CNTK

Training Prediction EC2 SaaS Studio SaaS

Google Cloud Amazon Web Service Microsoft Azure Cloud

Google Cloud Machine Learning Engine Amazon Machine Learning Azure Machine Learning Studio
Define Cloud Machine Learning
! Why Cloud Machine Learning

! Train in local machine


! No resource isolation
! No resource sharing
! No cluster orchestration
! No auto-scaling
Example: pip install tensorflow
! No automatical failover
Define Cloud Machine Learning
! Architecture of Cloud Machine Learning

Application Layer
TensorFlow / MXNet / …

Machine Learning Layer


Training / Prediction / …

Cloud Platform Layer


Kubernetes / OpenStack / …
Define Cloud Machine Learning
! Architecture of Cloud Machine Learning

模型开发 训练任务

线上服务
Define Cloud Machine Learning
! Architecture of Google-like Cloud Machine Learning

Submit train job Create train container TensorFlow MXNet

API K8S
Client Create model service Create model container TF Serving RESTful
Service Cluster

Submit prediction job Create predict container Online Req Offline Req
Define Cloud Machine Learning
! Architecture of Google-like Cloud Machine Learning

Step 1: Build docker image Step 2: Implement API service Step 3: Submit to Kubernetes
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
Re-define Cloud Machine Learning

! TensorFlow vs Hadoop
! TensorFlow vs Spark
! TensorFlow vs Hive
! TensorFlow vs PowerGraph
! TensorFlow vs Azure ML Studio
! TensorFlow vs H2O / Dataiku / 数加
Re-define Cloud Machine Learning
! We need all of these!

! HDFS: for large data storage


! Hive: for data preprocessing
! Spark: for feature extraction
! Hadoop: for task scheduling
! TensorFlow: for model training “Super-machine-learning-man”

! Kubernetes: for CPU/GPU management


Re-define Cloud Machine Learning
! We want all of these!

! Closed-loop from data preprocessing to online services


! Feature extraction without writing code
! Easy to define machine learning process
! Flexible and heterogeneous infrastructure
! Automatically failover and scaling
! Easy to use for the domain experts
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
Cloud-ML at 4Paradigm
! 先知平台
Cloud-ML at 4Paradigm
! 先知平台

! 简化数据引⼊入,⽀支持RDBMS和HDFS数据源
! 简化数据拆分,⽀支持按⽐比例例拆分和按规则拆分
! 简化特征抽取,⽀支持连续特征和离散特征的组合
! 简化模型训练,⽀支持⾃自研超⾼高维度LR和开源框架算法
! 简化模型评估,⽀支持ROC、Logloss、K-S等评估指标
Cloud-ML at 4Paradigm
! 先知平台

某国Top1的新闻App推荐,优化点击率提升34% 运营⼩小编专 机器器学习


家经验规则 模型推荐

某知识分享领域Top3 App⾳音频推荐,优化听完率提升43% ⽤用户喜欢

机器器学习 ⽤用户喜欢
个性化推荐

某秀场类直播Top3 App主播推荐,优化收看时⻓长提升21% ⽤用户⽆无感

⽤用户⽆无感

某国内最⼤大的UGC社区内容推荐,优化点击率提升93%
Cloud-ML at 4Paradigm

prophet.4paradigm.com
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
Cloud-ML 架构

Cloud-ML

⼩小 ⼩小
PaaS SaaS
⽶米 ⽶米
Dev Training Serving Vision NLP ASR
融 ⽣生
合 态
云 云
FDS
Docker + Kubernets
(⼩小⽶米⽂文件存储服务)
Cloud-ML 主要功能
Cloud-ML 使⽤用情况

20+ 5
4个 150+
⼩小⽶米内部业务 家⼩小⽶米⽣生态链
集群部署 开发者使⽤用
接⼊入 公司接⼊入

Cloud-ML
Cloud-ML 实践: PaaS改进
! Dev环境
! 提供模型开发功能
! 实现:
! 提供主流计算框架镜像,⽀支持ssh
! 以kubernetes service运⾏行行计算框架
! 问题:
! Pod可能被重新调度
! 数据持久化
! 端⼝口对外开放
Cloud-ML 实践: PaaS改进
! Dev环境数据持久化
! Fuse
! ⽀支持主要的Posix接⼝口
! FDS⽀支持Fuse
! ⽀支持将⽤用户Bucket挂载到本地
! Kubernetes⽀支持Fuse
! 启动Dev Pod时挂载/dev/fuse
! Cloud-ML⽀支持Fuse
! 创建Dev时Mount FDS Bucket
Cloud-ML 实践: PaaS改进
! Dev环境端⼝口开放
! 需求: 在Dev环境中开放可以被外部访问的端⼝口
! ⽅方案:
Public Access
! HAProxy实现转发
! 转发节点以service启动
! 防⽕火墙规则配置 转发节点 计算节点
Eip: hostport

Kube-proxy Dev
Docker Proxy HAProxy
Cloud-ML 实践: PaaS改进
! Serving 服务发现
! 现状:
service
name service
! 配置Nodeport
client pods
service: pods
! 控制节点转发到service
! 问题: collector
request

! 控制节点单点
! Port标识对业务不不友好 service1:pods Kubernetes service2:pods …

! ⽅方案:name service
Cloud-ML 实践: SaaS服务
图像识别

⾃自然语⾔言处理理

语⾳音识别
Cloud-ML 实践: SaaS服务
! 使⽤用场景

图像/语⾳音/⽂文本
智能设备 App Server Cloud-ML SaaS

FDS
(⼩小⽶米⽂文件存储服务)
Cloud-ML 实践: SaaS服务

图像识别 ⼈人脸检测: ⼈人脸位置、性别、年年龄


物体识别: 1500+ 物体分类(包括客厅、卧室等场景)

FaceInfo:
topX: 208
topY: 73
width: 403
height: 403
child
female
age: 5.6
Cloud-ML 实践: SaaS服务

图像识别 ⼈人脸检测: ⼈人脸位置、性别、年年龄


物体识别: 1500+ 物体分类(包括客厅、卧室等场景)

物体 置信度
客厅(living room) 0.52
餐厅(dining room) 0.14
⼤大厅(hall) 0.08
休闲室(waiting room) 0.06
Cloud-ML 实践: SaaS服务
! 图像识别
Cloud-ML 实践
! 将来的⼯工作
! PaaS
! ⽀支持更更多的训练框架
! Kaldi, CNTK
! Dev环境状态可保存
! 资源超卖
! 与数据处理理流程⽆无缝集成
! SaaS
! 上线更更多模型服务

Вам также может понравиться