正则表达式解析 
正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。
为什么引入正则表达式? 
用来匹配一类具有相同规则字符串
规则 
单字符 
1 2 3 4 5 6 7 8 . : 除换行以外所有字符 [] :[aoe] [a-w] 匹配集合中任意一个字符 \d :数字  [0-9] \D : 非数字 \w :数字、字母、下划线、中文 \W : 非\w \s :所有的空白字符 \S : 非空白 
 
数量修饰 
1 2 3 4 5 6 * : 任意多次  >=0 + : 至少1次   >=1 ? : 可有可无  0次或者1次 {m} :固定m次 {m,} :至少m次 {m,n} :m-n次 
 
边界 
1 2 3 \b \B  $ : 以某某结尾  ^ : 以某某开头 
 
分组 
1 2 3 (ab){3} (){4}  视为一个整体 ()     子模式\组模式   \1  \2 
1 2 3 4 5 6 import  restring = '<p><div><span>猪八戒</span></div></p>'  pattern = re.compile (r'<(\w+)><(\w+)>\w+</\2></\1>' ) ret = pattern.search(string) print (ret)
 
贪婪模式 
1 2 3 4 5 6 7 .*?  .+? re.I : 忽略大小写 re.M :多行匹配 re.S :单行匹配 match\search\findall re.sub(正则表达式, 替换内容, 字符串) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import  restring = '''hate is a beautiful feel  love you very much love she love her''' string1 = """<div>沁园春-雪  北国风光 千里冰封 万里雪飘 望长城内外 惟余莽莽 大河上下 顿失滔滔 山舞银蛇 原驰蜡象 欲与天公试比高 </div>""" pattern = re.compile (r'<div>(.*)</div>' , re.S) ret = pattern.findall(string1) print (ret)
 
需求 
1 2 爬取指定页面的标题和内容 保存到html文件中,标题用h1,内容使用p即可 
 
 
示例1 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import  urllib.requestimport  urllib.parseimport  reimport  osimport  timedef  handle_request (url, page ):    url = url + str (page) + '/'           headers = {         'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'      }     request = urllib.request.Request(url=url, headers=headers)     return  request def  download_image (content ):    pattern = re.compile (r'<div class="thumb">.*?<img src="(.*?)" .*?>.*?</div>' , re.S)     lt = pattern.findall(content)               for  image_src in  lt:                  image_src = 'https:'  + image_src                           dirname = 'qiutu'          if  not  os.path.exists(dirname):             os.mkdir(dirname)                  filename = image_src.split('/' )[-1 ]         filepath = dirname + '/'  + filename         print ('%s图片正在下载......'  % filename)         urllib.request.urlretrieve(image_src, filepath)         print ('%s图片结束下载......'  % filename)         time.sleep(1 ) def  main ():    url = 'https://www.qiushibaike.com/pic/page/'      start_page = int (input ('请输入起始页码:' ))     end_page = int (input ('请输入结束页码:' ))     for  page in  range (start_page, end_page + 1 ):         print ('第%s页开始下载....'  % page)                  request = handle_request(url, page)                  content = urllib.request.urlopen(request).read().decode()                  download_image(content)         print ('第%s页开始下载结束'  % page)         print ()         print ()         time.sleep(2 ) if  __name__ == '__main__' :    main() 
示例2 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 import  urllib.requestimport  urllib.parseimport  redef  handle_request (url, page=None  ):         if  page != None :         url = url + str (page) + '.html'      headers = {         'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'      }          request = urllib.request.Request(url=url, headers=headers)     return  request def  get_text (a_href ):         request = handle_request(a_href)          content = urllib.request.urlopen(request).read().decode()          pattern = re.compile (r'<div class="neirong">(.*?)</div>' , re.S)     lt = pattern.findall(content)     text = lt[0 ]          pat = re.compile (r'<img .*?>' )     text = pat.sub('' , text)               return  text def  parse_content (content ):         pattern = re.compile (r'<h3><a href="(/lizhi/qianming/\d+\.html)">(.*?)</a></h3>' )          lt = pattern.findall(content)          for  href_title in  lt:                  a_href = 'http://www.yikexun.cn'  + href_title[0 ]                  title = href_title[-1 ]                  text = get_text(a_href)                  string = '<h1>%s</h1>%s'  % (title, text)         with  open ('lizhi.html' , 'a' , encoding='utf8' ) as  fp:             fp.write(string) def  main ():    url = 'http://www.yikexun.cn/lizhi/qianming/list_50_'      start_page = int (input ('请输入起始页码:' ))     end_page = int (input ('请输入结束页码:' ))     for  page in  range (start_page, end_page + 1 ):                  request = handle_request(url, page)                  content = urllib.request.urlopen(request).read().decode()                  parse_content(content) if  __name__ == '__main__' :    main()