环境
python版本号 | 系统 | 游览器 |
---|---|---|
python 3.7.2 | win7 | google chrome |
关于本文
本文将会通过爬虫的方式实现简单的百度翻译。本文中的代码只供学习,不允许作为于商务作用。商务作用请前往api.fanyi.baidu.com购买付费的api。若有侵犯,立即删文!
实现思路
在网站文件中找到隐藏的免费api。传入api所需要的参数并对其发出请求。在返回的json结果里找到相应的翻译结果。
百度翻译的反爬机制
- 由js算法生成的sign
- cookie检测
- token暗号
在网站文件中找到隐藏的免费api
进入百度翻译,随便输入一段需要翻译的文字。当翻译结果出来的时候,按下F12,选择到NETWORK,最后点进XHR文件。这个时候,网站文件都已经加载完了,所以要F5刷新一下。
刷新了之后,我们就能发现一个以v2transapi?开头的文件,没错,它就是我们要找的api接口。让我们验证一下,点进去文件-preview,我们就可以在json格式的数据里面找到翻译结果,验证成功。
另外,我们还需要获取我们的cookie和token,在之后的反爬机制中我们需要用到它们,位置如以下。
cookie位置:
token位置:
api信息
接口:https://fanyi.baidu.com/v2tra…
请求方式:post
请求参数大全
参数 | 介绍 |
---|---|
from | 源语言 |
to | 目标语言 |
query | 翻译文本 |
sign | 由js算法生成的签名(反爬) |
token | 请求暗号 |
开始写代码
导入request和execjs库
import requests import execjs
- requests HTTP库,用于爬虫
- execjs 用于调用js代码
反反爬虫
由于百度翻译有cookie识别反爬机制,所以我们设置好我们刚刚获取到的cookie来进行掩护网络蜘蛛身份。
headers = {\'cookie\':\'请在这里输入你的cookie\'}
另外,我们还要设置好token(暗号)。
token = \'请在这里放置你的token\'
最后只剩下sign反爬机制了,sign是由js算法给译文生成的一个签名。我在网上搜了一下,找到了相应的js算法,分享给大家。
var i = \"320305.131321201\" function n(r, o) { for (var t = 0; t < o.length - 2; t += 3) { var a = o.charAt(t + 2); a = a >= \"a\" ? a.charCodeAt(0) - 87 : Number(a), a = \"+\" === o.charAt(t + 1) ? r >>> a : r << a, r = \"+\" === o.charAt(t) ? r + a & 4294967295 : r ^ a } return r } function e(r) { var o = r.match(/[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]/g); if (null === o) { var t = r.length; t > 30 && (r = \"\" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10)) } else { for (var e = r.split(/[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) \"\" !== e[C] && f.push.apply(f, a(e[C].split(\"\"))), C !== h - 1 && f.push(o[C]); var g = f.length; g > 30 && (r = f.slice(0, 10).join(\"\") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join(\"\") + f.slice(-10).join(\"\")) } var u = void 0, l = \"\" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107); u = null !== i ? i : (i = window[l] || \"\") || \"\"; for (var d = u.split(\".\"), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) { var A = r.charCodeAt(v); 128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128) } for (var p = m, F = \"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + (\"\" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = \"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + (\"\" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + (\"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F); return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + \".\" + (p ^ m) }
等等,我们不是在用python进行爬虫吗?那我们又不会js代码,怎么调用啊?
还好python有着强大的第三方库,当然也少不了调用js代码的库。调用js代码的库很多,但是本人还是推荐大家使用execjs,简单功能又完整。
在调用js算法代码之前,我们还需要让用户输入一段译文。
q = input(\'翻译:\')
之后我们就能使用execjs的compile和call方法来获取sign了。
js = \'\'\'var i = \"320305.131321201\" function n(r, o) { for (var t = 0; t < o.length - 2; t += 3) { var a = o.charAt(t + 2); a = a >= \"a\" ? a.charCodeAt(0) - 87 : Number(a), a = \"+\" === o.charAt(t + 1) ? r >>> a : r << a, r = \"+\" === o.charAt(t) ? r + a & 4294967295 : r ^ a } return r } function e(r) { var o = r.match(/[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]/g); if (null === o) { var t = r.length; t > 30 && (r = \"\" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10)) } else { for (var e = r.split(/[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) \"\" !== e[C] && f.push.apply(f, a(e[C].split(\"\"))), C !== h - 1 && f.push(o[C]); var g = f.length; g > 30 && (r = f.slice(0, 10).join(\"\") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join(\"\") + f.slice(-10).join(\"\")) } var u = void 0, l = \"\" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107); u = null !== i ? i : (i = window[l] || \"\") || \"\"; for (var d = u.split(\".\"), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) { var A = r.charCodeAt(v); 128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128) } for (var p = m, F = \"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + (\"\" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = \"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + (\"\" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + (\"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F); return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + \".\" + (p ^ m) } \'\'\' sign = execjs.compile(js).call(\"e\",q)
(以上代码获取了sign)
经过一系列的反反爬虫准备,我们就可以设置好的“源语言”和“目标语言”最后的这两个参数了。
From = \'en\' To = \'zh\'
(以上代码代表着英译中,若要进行其它语言的翻译,请输入语言对应的英文缩写,英文缩写对应表将会放到本文最后)
接着,我们就能构建参数json了。
data = {\'from\':From, \'to\':To, \'query\':q, \'sign\':sign, \'token\':\'14b5f31e3c65d89a0b1c3f756e53942e\'}
最后,我们就能请求数据并打印了。
text = requests.post(url,headers=headers,data=data).json() print(text)
我们发现打印出来的结果是个json字典,翻译结果就在其中,我们只需要翻译结果,所以我们可以索引翻译结果的位置再打印。
text = requests.post(url,headers=headers,data=data).json()[\'trans_result\'][\'data\'][0][\'dst\'] print(text)
运行结果:
完整代码:
import requests import execjs url = \'https://fanyi.baidu.com/v2transapi\' headers = {\'cookie\':\'你的cookie\'} js = \'\'\'var i = \"320305.131321201\" function n(r, o) { for (var t = 0; t < o.length - 2; t += 3) { var a = o.charAt(t + 2); a = a >= \"a\" ? a.charCodeAt(0) - 87 : Number(a), a = \"+\" === o.charAt(t + 1) ? r >>> a : r << a, r = \"+\" === o.charAt(t) ? r + a & 4294967295 : r ^ a } return r } function e(r) { var o = r.match(/[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]/g); if (null === o) { var t = r.length; t > 30 && (r = \"\" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10)) } else { for (var e = r.split(/[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) \"\" !== e[C] && f.push.apply(f, a(e[C].split(\"\"))), C !== h - 1 && f.push(o[C]); var g = f.length; g > 30 && (r = f.slice(0, 10).join(\"\") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join(\"\") + f.slice(-10).join(\"\")) } var u = void 0, l = \"\" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107); u = null !== i ? i : (i = window[l] || \"\") || \"\"; for (var d = u.split(\".\"), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) { var A = r.charCodeAt(v); 128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128) } for (var p = m, F = \"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + (\"\" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = \"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + (\"\" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + (\"\" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F); return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + \".\" + (p ^ m) } \'\'\' From = \'源语言\' To = \'目标语言\' token = \'你的token\' q = input(\'翻译:\') sign = execjs.compile(js).call(\"e\",q) data = {\'from\':From, \'to\':To, \'query\':q, \'sign\':sign, \'token\':token} text = requests.post(url,headers=headers,data=data).json()[\'trans_result\'][\'data\'][0][\'dst\'] print(text)
语言英文缩写对应表
{ \'zh\': \'中文\',\'jp\': \'日语\',\'jpka\': \'日语假名\',\'th\': \'泰语\',\'fra\': \'法语\',\'en\': \'英语\',\'spa\': \'西班牙语\',\'kor\': \'韩语\',\'tr\': \'土耳其语\',\'vie\': \'越南语\',\'ms\': \'马来语\',\'de\': \'德语\',\'ru\': \'俄语\',\'ir\': \'伊朗语\',\'ara\': \'阿拉伯语\',\'est\': \'爱沙尼亚语\',\'be\': \'白俄罗斯语\',\'bul\': \'保加利亚语\',\'hi\': \'印地语\',\'is\': \'冰岛语\',\'pl\': \'波兰语\',\'fa\': \'波斯语\',\'dan\': \'丹麦语\',\'tl\': \'菲律宾语\',\'fin\': \'芬兰语\',\'nl\': \'荷兰语\',\'ca\': \'加泰罗尼亚语\',\'cs\': \'捷克语\',\'hr\': \'克罗地亚语\',\'lv\': \'拉脱维亚语\',\'lt\': \'立陶宛语\',\'rom\': \'罗马尼亚语\',\'af\': \'南非语\',\'no\': \'挪威语\',\'pt_BR\': \'巴西语\',\'pt\': \'葡萄牙语\',\'swe\': \'瑞典语\',\'sr\': \'塞尔维亚语\',\'eo\': \'世界语\',\'sk\': \'斯洛伐克语\',\'slo\': \'斯洛文尼亚语\',\'sw\': \'斯瓦希里语\',\'uk\': \'乌克兰语\',\'iw\': \'希伯来语\',\'el\': \'希腊语\',\'hu\': \'匈牙利语\',\'hy\': \'亚美尼亚语\',\'it\': \'意大利语\',\'id\': \'印尼语\',\'sq\': \'阿尔巴尼亚语\',\'am\': \'阿姆哈拉语\',\'as\': \'阿萨姆语\',\'az\': \'阿塞拜疆语\',\'eu\': \'巴斯克语\',\'bn\': \'孟加拉语\',\'bs\': \'波斯尼亚语\',\'gl\': \'加利西亚语\',\'ka\': \'格鲁吉亚语\',\'gu\': \'古吉拉特语\',\'ha\': \'豪萨语\',\'ig\': \'伊博语\',\'iu\': \'因纽特语\',\'ga\': \'爱尔兰语\',\'zu\': \'祖鲁语\',\'kn\': \'卡纳达语\',\'kk\': \'哈萨克语\',\'ky\': \'吉尔吉斯语\',\'lb\': \'卢森堡语\',\'mk\': \'马其顿语\',\'mt\': \'马耳他语\',\'mi\': \'毛利语\',\'mr\': \'马拉提语\',\'ne\': \'尼泊尔语\',\'or\': \'奥利亚语\',\'pa\': \'旁遮普语\',\'qu\': \'凯楚亚语\',\'tn\': \'塞茨瓦纳语\',\'si\': \'僧加罗语\',\'ta\': \'泰米尔语\',\'tt\': \'塔塔尔语\',\'te\': \'泰卢固语\',\'ur\': \'乌尔都语\',\'uz\': \'乌兹别克语\',\'cy\': \'威尔士语\',\'yo\': \'约鲁巴语\',\'yue\': \'粤语\',\'wyw\': \'文言文\',\'cht\': \'中文繁体\' }
以上就是python 爬虫如何实现百度翻译的详细内容,更多关于python 爬虫实现百度翻译的资料请关注自学编程网其它相关文章!