按照崔大佬的书目录,最终是做成一个分布式的爬虫,用框架爬取所有的微博.So,我就按着步骤来,从代理池,cookies池,到最后的crapy框架.

首先,分析一下微博四宫格验证码,它长下面这个样子哈.那么一共有4*6=24中验证码.一种方法是从图像处理的方式来做,但是有个问题

上面这个图里面,我姑且称为4->3->2->1形验证码.那么1->2->3->4形验证码是不是跟这个验证码非常的想象呢?是的.他们只有中间的三个箭头方向相反,其他一模一样.

所以如果采用图像处理算法来做,必须非常的精确才能做到啊.那么我们就采用第二种方法,对比法.只有24种验证码,那么把所有的验证码都保存在文件夹里面.然后登陆的时候把验证码按固定的位置截屏和本地保存的验证码逐一对比,设置一个阈值(0.98),这样,就可以精确的找出当前的验证码.再把本地24种验证码的图片名字都存储1432.png这种类型,对比成果后提取名字前的数字,然后做成list就可以用selenium的ActionChains模块拖动.模拟人拖动验证码登录了.

需要的库

代码语言:javascript复制import requests

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.common.exceptions import TimeoutException

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.chrome.options import Options

from selenium.webdriver import ActionChainsfrom PIL import Image

import time

from io import BytesIO

from os.path import abspath, dirname

from os import listdir这里我用了无头的Chrome(),平时调试用的是有头的.哈哈哈.最后试用了一下无头的,也能成功.

账号密码

代码语言:javascript复制USERNAME = 'username'

PASSWORD = 'password'代码语言:javascript复制TEMPLATES_FOLDER = dirname(abspath('__file__')) + '/templates/'username你得换成你自己的账号,password同理,你的密码.TEMPLATES_FOLEDER是你保存24种验证码的图片的文件.

创造类,并初始化

代码语言:javascript复制class WeiboCookies(object):

def __init__(self, username=USERNAME, password=PASSWORD):

self.url = 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=https%3A%2F%2Fm.weibo.cn%2F&sudaref=m.weibo.cn'

self.username = USERNAME

self.password = PASSWORD

self.browser = self.init_browser()

#self.browser = webdriver.Chrome()

self.wait = WebDriverWait(self.browser, 20)有头Chrome就是注释掉那一行,如果要用有头的把注释取消,然后把上面那一行注释.(下面的代码是无头的初始化,如果要有头那前面操作然后下面的代码可以不用写):

代码语言:javascript复制def init_browser(self):

options = webdriver.ChromeOptions()

options.add_argument('--headless')

self.browser = webdriver.Chrome(options = options)

return self.browser打开登录界面,并输入账号密码点击登录

代码语言:javascript复制def open(self):

self.browser.get(self.url)

username = self.wait.until(EC.presence_of_element_located((By.ID, 'loginName')))

password = self.wait.until(EC.presence_of_element_located((By.ID, 'loginPassword')))

submit = self.wait.until(EC.element_to_be_clickable((By.ID, 'loginAction')))

username.send_keys(self.username)

password.send_keys(self.password)

time.sleep(1)

submit.click()密码错误或者直接登录

代码语言:javascript复制def password_error(self):

try:

return WebDriverWait(self.browser, 5).until(

EC.text_to_be_present_in_element((By.ID, 'errorMsy'),'用户名或密码错误'))

except TimeoutException:

return False

def login_successfully(self):

try:

return bool(WebDriverWait(self.browser, 5).until(

EC.presence_of_element_located((By.CLASS_NAME, 'lite-iconf'))))

except TimeoutException:

return False 回到之前,登录之后有三种状态,一是直接登录成功,二是账号密码错误,三是验证码.微博点击登录之后,一般是出现验证码,滑动之后判断有没有账号密码出错.

处理验证码

代码语言:javascript复制def get_position(self):

try:

img = self.wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'patt-shadow')))

except TimeoutException as e:

print('验证码未出现')

self.open()

time.sleep(2)

location = img.location

size = img.size

top, bottom, left, right = location['y'], location['y'] + size['height'], location['x'], location['x'] + size['width']

return (top,bottom,left,right)

def get_screenshot(self):

screenshot = self.browser.get_screenshot_as_png()

screenshot = Image.open(BytesIO(screenshot))

return screenshot

def get_image(self, name='captcha.png'):

top, bottom, left, right = self.get_position()

screenshot = self.get_screenshot()

captcha = screenshot.crop((left,top,right,bottom))

captcha.save(name)

return captcha

def is_pixel_equal(self, image1, image2, x, y):

pixel1 = image1.load()[x,y]

pixel2 = image2.load()[x,y]

threshold = 20

if abs(pixel1[0] - pixel2[0]) < threshold and abs(pixel1[1] - pixel2[1]) < threshold and abs(pixel1[2] - pixel2[2]) < threshold:

return True

else:

return False

def same_image(self, image, template):

threshold = 0.98

count = 0

for x in range(image.width):

for y in range(image.height):

if self.is_pixel_equal(image, template, x, y):

count += 1

result = float(count) / (image.width * image.height)

if result > threshold:

print('成功匹配')

return True

return False

def detect_image(self, image):

for template_name in listdir(TEMPLATES_FOLDER):

print('正在匹配', template_name)

if template_name == '.DS_Store':

continue

template = Image.open(TEMPLATES_FOLDER + template_name)

if self.same_image(image, template):

numbers = [int(number) for number in list(template_name.split('.')[0])]

print('拖动顺序', numbers)

return numbers

def move(self, numbers):

try:

circles = self.browser.find_elements_by_css_selector('.patt-wrap .patt-circ')

dx = dy = 0

for index in range(4):

circle = circles[numbers[index] - 1]

if index == 0:

ActionChains(self.browser).move_to_element_with_offset(circle, circle.size['width'] / 2, circle.size['height'] / 2) \

.click_and_hold().perform()

else:

times = 30

for i in range(times):

print(dx,dy)

ActionChains(self.browser).move_by_offset(dx / times, dy / times).perform()

time.sleep(1 / times)

if index == 3:

ActionChains(self.browser).release().perform()

else:

dx = circles[numbers[index + 1] - 1].location['x'] - circle.location['x']

dy = circles[numbers[index + 1] - 1].location['y'] - circle.location['y']

except:

print('滑动失败,重来')

self.main() 上面get_position()是获取验证码位置的,用于截屏之后提取目标图片.get_screenshot()是用来截屏的.get_image()是用来得到四宫格图像的.is_pixel_queal()是用来判定两幅图像里面的像素点一样不,循环两幅图的所以像素点就可以得到有多少像素点相同,用于判定最后图像是不是相同.detect_image()是登录时候得到验证码后将验证码和本地的所以24张验证码做对比,最后得到相同的验证码.move()是得到验证码顺序之后,用selenium滑动验证码.

得到cookies

代码语言:javascript复制def get_cookies(self):

return self.browser.get_cookies()登录成功之后获取cookies

程序主要流程

代码语言:javascript复制self.open()

if self.password_error():

return {

'status': 2,

'content': '用户名或密码错误'

}

if self.login_successfully():

cookies = self.get_cookies()

return {

'status': 1,

'content': cookies

}

image = self.get_image('captcha.png')

numbers = self.detect_image(image)

self.move(numbers)

if self.login_successfully():

cookies = self.get_cookies()

return{

'status': 1,

'content': cookies

}

else:

return{

'status': 3,

'content': '登录失败'

}有三种状态,2是密码错误,1是登录成功,3是登录失败,就是验证码滑动不对.

用例子示范

代码语言:javascript复制if __name__ == '__main__':

weibocookies = WeiboCookies()

t = weibocookies.main()

print(t)总结

mac上如果阈值设定为0.99,将匹配不上,0.98刚刚能匹配出来,太低也不行.

mac上本地存储24个验证码的时候文件夹里面有个.DS_Store的必须要跳过它,不然会失败.Windows上没必要.