celery配合redis出现redis.exceptions.InvalidResponse Protocol Error

2020年6月21日 09:29

一受多攻双根

一受多攻双根

Tags:RedisCelery
评论(0)阅读(106)

eventlet在ubuntu上出现OSError protocol not found

2020年6月21日 09:24

描述

 
tensorflow的nvidia docker镜像使用ubuntu16.04, ubuntu是精简之后的,有些包可能没有。在上面运行eventlet会出现下面问题
 
 

错误内容

 
Traceback (most recent call last):File "/app/defect-client/defect_client/cmd/wafer-worker.py", line 14, in <module>import eventletFile "/usr/local/lib/python3.6/dist-packages/eventlet/__init__.py", line 10, in <module>from eventlet import convenienceFile "/usr/local/lib/python3.6/dist-packages/eventlet/convenience.py", line 7, in <module>from eventlet.green import socketFile "/usr/local/lib/python3.6/dist-packages/eventlet/green/socket.py", line 21, in <module>from eventlet.support import greendnsFile "/usr/local/lib/python3.6/dist-packages/eventlet/support/greendns.py", line 69, in <module>setattr(dns.rdtypes.IN, pkg, import_patched('dns.rdtypes.IN.' + pkg))File "/usr/local/lib/python3.6/dist-packages/eventlet/support/greendns.py", line 59, in import_patchedreturn patcher.import_patched(module_name, **modules)File "/usr/local/lib/python3.6/dist-packages/eventlet/patcher.py", line 126, in import_patched*additional_modules + tuple(kw_additional_modules.items()))File "/usr/local/lib/python3.6/dist-packages/eventlet/patcher.py", line 100, in injectmodule = __import__(module_name, {}, {}, module_name.split('.')[:-1])File "/usr/local/lib/python3.6/dist-packages/dns/rdtypes/IN/WKS.py", line 25, in <module>_proto_tcp = socket.getprotobyname('tcp')OSError: protocol not found
 

解决办法

 
apt-get -o Dpkg::Options::="--force-confmiss" install --reinstall netbase
 

Tags:eventletpython
评论(12)阅读(90)

celery变量共享

2020年6月21日 09:19

问题

 
很多情况下我们想让task共享变量,该怎么做?
 

celery的并发原理

 
celery的并发任务池,有eventlet, gevent, prefork, thread类型
 
eventlet/gevent协程: 只有一个进程一个线程, 全局变量在task之间共享
    
prefork属于multiprocessing: multiprocessing全局变量也是共享的
 
thread多线程: 全局变量共享
    

 验证方法

 
用ab命令模拟大量并发,很容易测试出来
 
ab -n 1000 -c 100 -p ./post.txt -T application/json http://xxxx:5000/xxx
 

 结论

 
1. celery如果访问数据库, gpu等资源, 不用担心多次加载
 
2. 注意: 如果在task中初始化全局变量, 初始化较慢, 同时又收到大量task请求,可能会导致初始化多次
    
 
 

Tags:celery
评论(0)阅读(93)

protobuf序列化numpy

2020年6月15日 15:02

 

说明

 
protobuf处理不能直接处理numpy,需要先把numpy转为字节
 

 numpy转字节

 
import numpy as npfrom io import BytesIOA = np.array([ 1, 2, 3, 4, 4,2, 3, 4, 5, 3,4, 5, 6, 7, 2,5, 6, 7, 8, 9,6, 7, 8, 9, 0 ]).reshape(5,5) # numpy 转bytesnda_bytes = BytesIO()np.save(nda_bytes, A, allow_pickle=False)# bytes转numpynda_bytes = BytesIO(nda_bytes.getvalue())B = np.load(nda_bytes, allow_pickle=False)print(np.array_equal(A, B))
 

定义protobuf message

 
ndarray.proto
 
syntax = "proto3";message NDArray {bytes ndarray = 1;}
 

使用

 
from io import BytesIOimport numpy as npimport ndarray_pb2 #上面ndarray.proto编译成pythondef ndarray_to_proto(nda: np.ndarray) -> NDArray:"""numpy转proto"""nda_bytes = BytesIO()np.save(nda_bytes, nda, allow_pickle=False)return NDArray(ndarray=nda_bytes.getvalue())def proto_to_ndarray(nda_proto: NDArray) -> np.ndarray:nda_bytes = BytesIO(nda_proto.ndarray)return np.load(nda_bytes, allow_pickle=False)A = np.array([ 1, 2, 3, 4, 4,2, 3, 4, 5, 3,4, 5, 6, 7, 2,5, 6, 7, 8, 9,6, 7, 8, 9, 0 ]).reshape(5,5)serialized_A = ndarray_to_proto(A)deserialized_A = proto_to_ndarray(serialized_A)assert np.array_equal(A, deserialized_A)
 
 
 

Tags:protobuf
评论(13)阅读(138)

基于docker搭建cephfs分布式文件系统

2020年5月27日 12:32

 

目的 

 
在一台机器上, 利用多块硬盘, 搭建一个cephfs文件系统. 具体来说就是1个mon, 1个mds, 1个mgr, 3个osd
 

注意

 
a. 使用vmware会很方便
 
b. 安装过程中会遇到很多问题,我都没有记录, 尽量安装下面步骤
 

环境准备

 
a. vmware虚拟机fedora30
 
b. 添加3块虚拟机硬盘 /dev/sdb  /dev/sdc /dev/sdd (osd最少需要3个,需要有3块磁盘)
 
c. ceph容器版本 ceph/daemon:latest-luminous
 
 

搭建步骤

 
1. 下载镜像
 
docker pull ceph/daemon:latest-luminous
 
2. 挂载硬盘
vmware虚拟机添加硬盘很方便, 直接加就可以. fdisk -l 查看硬盘
 
3. 清理硬盘
 
# 格式化mkfs.xfs /dev/sdb -fmkfs.xfs /dev/sdc -fmkfs.xfs /dev/sdd -f# 如果已经是xfs格式, 上面命令并不能清除已有数据, 需要用zap_device清理docker run -d --net=host --name=osd0--rm \--privileged=true \-v /dev/:/dev/ \-e OSD_DEVICE=/dev/sde\ ceph/daemon:latest-luminous zap_device
 
4. 准备目录
 
/root/ceph/root/ceph/etc/root/ceph/lib
 
 
5. 启动mon (监控节点必需)
 
docker run -d --net=host--name=mon \-v /root/ceph/etc:/etc/ceph \-v /root/ceph/lib/:/var/lib/ceph/ \-e MON_IP=192.168.10.125 \-e CEPH_PUBLIC_NETWORK=192.168.10.0/24 \ ceph/daemon:latest-luminous mon
 
6. 启动mgr(可以不用)
 
docker run -d --net=host --name=mgr \-v /root/ceph/etc:/etc/ceph\-v /root/ceph/lib/:/var/lib/ceph\ceph/daemon:latest-luminousmgr
 
7. 启动osd
 
# 修改-name和OSD_DEVICE启动三个osddocker run -d --net=host --name=osd0 \--privileged=true \-v /root/ceph/etc:/etc/ceph\-v /root/ceph/lib/:/var/lib/ceph\-v /dev/:/dev/ \-e OSD_DEVICE=/dev/sdb\-e OSD_TYPE=disk \ ceph/daemon:latest-luminous osd
 
8. 启动mds(cephfs系统必需)
 
#一定要在osd之后创建启动, 因为CEPHFS_CREATE=1会创建cephfs文件系统,受osd数量影响docker run -d --net=host --name=mds \-v /root/ceph/etc:/etc/ceph \-v /root/ceph/lib/:/var/lib/ceph/ \-e CEPHFS_CREATE=1 \# 默认创建cephfs文件系统ceph/daemon:latest-luminous mds
 
9. 进入mon查看ceph状态
 
# 进入容器docker exec -it mon bash# 查看状态[root@localhost /]# ceph -scluster:id: 4d74fd53-84e0-47e6-a06c-5418e4b3b653health: HEALTH_WARN1 MDSs report slow metadata IOs2 osds down34/51 objects misplaced (66.667%)Reduced data availability: 4 pgs inactive, 16 pgs staleDegraded data redundancy: 16 pgs undersizedtoo few PGs per OSD (4 < min 30)services:mon: 1 daemons, quorum localhostmgr: localhost(active)mds: cephfs-1/1/1 up{0=localhost=up:creating}osd: 5 osds: 2 up, 4 indata:pools: 2 pools, 16 pgsobjects: 17 objects, 2.19KiBusage: 4.01GiB used, 75.6GiB / 79.6GiB availpgs: 25.000% pgs not active 34/51 objects misplaced (66.667%) 12 stale+active+undersized+remapped 4stale+undersized+peered
 
10. ceph调参: too few PGs per OSD (4 < min 30)
 
存储池的pg_num, pgp_num太小了, 设置大一点
 
ceph osd pool set cephfs_data pg_num 64ceph osd pool set cephfs_data pgp_num 64ceph osd pool setcephfs_metadata pg_num 32ceph osd pool setcephfs_metadata pgp_num 32
 
11. ceph调参: mds: cephfs-1/1/1 up  {0=localhost=up:creating}
 
mds一直处在creating状态, 因为默认I/O需要的最小副本数是2,我们需要调成1
 
ceph osd pool set cephfs_metadata min_size 1ceph osd pool set cephfs_data min_size 1
 
12. 再看ceph状态, mds状态是active表示cephfs搭建好了
 
mds: cephfs-1/1/1 up{0=localhost=up:active}
 
13. 挂载cephfs目录(直接mount)
 
# 获取keycat /root/ceph/etc/ceph.client.admin.keyring# 直接挂载mount -t ceph 192.168.10.125:6789:/ /root/abc -o name=admin,secret=AQAvoctebqeuBRAAp+FoatmQ5CUlSlo8dmvGAg==# 取消挂载umount /root/abc
 
14. 挂载cephfs目录(ceph-fuse)
 
# 安装ceph-fuseyum install ceph-fuse# 挂载(-k指定key -c表示配置文件)ceph-fuse -m 192.168.10.125:6789 /root/abc1 -k /root/ceph/etc/ceph.client.admin.keyring-c /root/ceph/etc/ceph.conf#取消挂载umount /root/abc1
 
15. 查看结果
 
df -h192.168.10.125:6789:/ 18G 0 18G0% /root/abcceph-fuse 18G 0 18G0% /root/abc1
 

 

Tags:Ceph
评论(18)阅读(121)

tensorflow资源耗净 Resource exhausted OOM when allocating tensor with shape

2020年5月01日 13:02

描述

 
tensorflow跑训练集经常会遇到错误Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048]
 

错误内容

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.(0) Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc[[node SecondStageBoxPredictor_1/ResizeBilinear (defined at /app/models/research/object_detection/predictors/heads/mask_head.py:149) ]]Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.[[total_loss/_7771]]Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.(1) Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc[[node SecondStageBoxPredictor_1/ResizeBilinear (defined at /app/models/research/object_detection/predictors/heads/mask_head.py:149) ]]Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.0 successful operations.0 derived errors ignored.Errors may have originated from an input operation.Input Source operations connected to node SecondStageBoxPredictor_1/ResizeBilinear: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/Relu (defined at /app/models/research/slim/nets/resnet_v1.py:136)Input Source operations connected to node SecondStageBoxPredictor_1/ResizeBilinear: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/Relu (defined at /app/models/research/slim/nets/resnet_v1.py:136)
 

原因

 
tensorflow在为张量shape[64,33,33,2048]分配gpu内存是发现资源不够。
 
假如数据类型是int8, 该张量需要的内存大小64*33*33*2048*1B=142737408B = 142.7MB
 

解决方法

 
1. 降低图片质量
2. batch_size改成1
3. 改用大内存的显卡
4. 增加显卡, 并行训练
 

 

Tags:tensorflownvidia
评论(1)阅读(318)

tesla t4的坑Unable to load the kernel module 'nvidia.ko'.ipynb

2020年4月24日 15:35

说明

 
安nvidia tesla T4显卡遇到的坑, 在ubuntu16.04上安装t4会遇到下面错误
 

错误内容

 
 make[1]: Leaving directory '/usr/src/linux-headers-4.4.0-142-generic'-> done.-> Kernel module compilation complete.ERROR: Unable to load the kernel module 'nvidia.ko'.This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA GPU(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.
 

解决方法

 
t4不支持普通服务器,更换成刀片服务器
 

补充

 
  • 如果普通主机操作系统是win10,插上t4,安装驱动正常。
 
  • 安装nvidia 2080Ti 驱动,如果忘记插显卡电源线,会提示同样错误
 
 
 

Tags:tensorflow
评论(87)阅读(377)

object-detection图片切割提示Invalid argument: Key: image/object/mask错误

2020年4月14日 14:09

  • tensorflow的object-detection切割图片出现错误
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.(0) Invalid argument: Key: image/object/mask.Data types don't match. Expected type: float, Actual type: string [[{{node ParseSingleExample/ParseSingleExample}}]] [[IteratorGetNext]] [[BatchMultiClassNonMaxSuppression/map/while/TensorArrayReadV3_5/_7587]](1) Invalid argument: Key: image/object/mask.Data types don't match. Expected type: float, Actual type: string [[{{node ParseSingleExample/ParseSingleExample}}]] [[IteratorGetNext]]0 successful operations.0 derived errors ignored.
  • 错误原因

pipeline配置文件缺少参数mask_type去指定mask类型

  • 设置方法
train_input_reader: {mask_type: PNG_MASKS}eval_input_reader: {mask_type: PNG_MASKS}

Tags:tensorflowobject-detection
评论(6)阅读(189)

socketio与apscheduler并用

2019年4月28日 10:16

 

 说明

 
flask项目引入了flask-socketio提供websocket通信,同时需要flask-apscheduler完成定时任务。
 
 

 问题描述

 
一受多攻双根
 
gunicorn --worker-class eventlet -w 1 zhima_chat:app -b 0.0.0.0:5000 --access-logfile -
 
后来要引入apscheduler
 
以上面的方式运行,出现了问题。该如何将socketio与apschedeuler结合呢?
 
 

Tags:flaskapschedulerpythonsocketio
评论(327)阅读(18419)

为jekyll制作docker镜像

2019年4月28日 10:08

说明

 
jekyll运行依赖ruby,每次重装都会遇到版本问题,挺麻烦,干脆做成镜像
 
 

官方镜像存在的问题

 
docker上有jekyll的官方镜像,如果是直接运行,没什么问题。
 
如果你挂载volume就会有权限问题
 
jekyll 3.8.5 | Error:Permission denied @ dir_s_mkdir - /srv/jekyll/_site
 
 

Tags:jekylldocker
评论(342)阅读(3758)