正则表达式解析
正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。
为什么引入正则表达式?
用来匹配一类具有相同规则字符串
规则
单字符
1 2 3 4 5 6 7 8 . : 除换行以外所有字符 [] :[aoe] [a-w] 匹配集合中任意一个字符 \d :数字 [0-9] \D : 非数字 \w :数字、字母、下划线、中文 \W : 非\w \s :所有的空白字符 \S : 非空白
数量修饰
1 2 3 4 5 6 * : 任意多次 >=0 + : 至少1次 >=1 ? : 可有可无 0次或者1次 {m} :固定m次 {m,} :至少m次 {m,n} :m-n次
边界
1 2 3 \b \B $ : 以某某结尾 ^ : 以某某开头
分组
1 2 3 (ab){3} (){4} 视为一个整体 () 子模式\组模式 \1 \2
1 2 3 4 5 6 import restring = '<p><div><span>猪八戒</span></div></p>' pattern = re.compile (r'<(\w+)><(\w+)>\w+</\2></\1>' ) ret = pattern.search(string) print (ret)
贪婪模式
1 2 3 4 5 6 7 .*? .+? re.I : 忽略大小写 re.M :多行匹配 re.S :单行匹配 match\search\findall re.sub(正则表达式, 替换内容, 字符串)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import restring = '''hate is a beautiful feel love you very much love she love her''' string1 = """<div>沁园春-雪 北国风光 千里冰封 万里雪飘 望长城内外 惟余莽莽 大河上下 顿失滔滔 山舞银蛇 原驰蜡象 欲与天公试比高 </div>""" pattern = re.compile (r'<div>(.*)</div>' , re.S) ret = pattern.findall(string1) print (ret)
需求
1 2 爬取指定页面的标题和内容 保存到html文件中,标题用h1,内容使用p即可
示例1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import urllib.requestimport urllib.parseimport reimport osimport timedef handle_request (url, page ): url = url + str (page) + '/' headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36' } request = urllib.request.Request(url=url, headers=headers) return request def download_image (content ): pattern = re.compile (r'<div class="thumb">.*?<img src="(.*?)" .*?>.*?</div>' , re.S) lt = pattern.findall(content) for image_src in lt: image_src = 'https:' + image_src dirname = 'qiutu' if not os.path.exists(dirname): os.mkdir(dirname) filename = image_src.split('/' )[-1 ] filepath = dirname + '/' + filename print ('%s图片正在下载......' % filename) urllib.request.urlretrieve(image_src, filepath) print ('%s图片结束下载......' % filename) time.sleep(1 ) def main (): url = 'https://www.qiushibaike.com/pic/page/' start_page = int (input ('请输入起始页码:' )) end_page = int (input ('请输入结束页码:' )) for page in range (start_page, end_page + 1 ): print ('第%s页开始下载....' % page) request = handle_request(url, page) content = urllib.request.urlopen(request).read().decode() download_image(content) print ('第%s页开始下载结束' % page) print () print () time.sleep(2 ) if __name__ == '__main__' : main()
示例2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 import urllib.requestimport urllib.parseimport redef handle_request (url, page=None ): if page != None : url = url + str (page) + '.html' headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36' } request = urllib.request.Request(url=url, headers=headers) return request def get_text (a_href ): request = handle_request(a_href) content = urllib.request.urlopen(request).read().decode() pattern = re.compile (r'<div class="neirong">(.*?)</div>' , re.S) lt = pattern.findall(content) text = lt[0 ] pat = re.compile (r'<img .*?>' ) text = pat.sub('' , text) return text def parse_content (content ): pattern = re.compile (r'<h3><a href="(/lizhi/qianming/\d+\.html)">(.*?)</a></h3>' ) lt = pattern.findall(content) for href_title in lt: a_href = 'http://www.yikexun.cn' + href_title[0 ] title = href_title[-1 ] text = get_text(a_href) string = '<h1>%s</h1>%s' % (title, text) with open ('lizhi.html' , 'a' , encoding='utf8' ) as fp: fp.write(string) def main (): url = 'http://www.yikexun.cn/lizhi/qianming/list_50_' start_page = int (input ('请输入起始页码:' )) end_page = int (input ('请输入结束页码:' )) for page in range (start_page, end_page + 1 ): request = handle_request(url, page) content = urllib.request.urlopen(request).read().decode() parse_content(content) if __name__ == '__main__' : main()