Python--Download image file from the HTML page source

Posted by: Max Chen | in Python | 1 year, 11 months ago |

Python--Download image file from the HTML page source

直接附上程式碼, 這是我要備份Medium CDN用的.

主要是用bs4解析, 再用request去download.

Python3 urlretrieve已不可使用, 下載檔案需用request.

import requests
from bs4 import BeautifulSoup as bs
from urllib.request import (
    urlopen, urlparse, urlunparse, urlretrieve)
import os
import sys
from os import listdir
from os.path import isfile, join
import random
import time

out_folder = "D:/BlogBackup"
mypath = "D:\BlogBackup\medium-export\posts"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for i in onlyfiles:
#     print("/BlogBackup/medium-export/posts/"+str(i))
    file = open("/BlogBackup/medium-export/posts/"+str(i), "r",encoding="utf-8").read()
    soup = bs(file)
    for image in soup.findAll("img"):
        url = image["src"]
        filename = image["src"].split("/")[-1]
        print(filename)
        r = requests.get(url)
        with open("/BlogBackup/MeduimImg/"+filename.replace("*","$"), 'wb') as outfile:
            outfile.write(r.content)
            time.sleep(random.random()*10)

參考來源: https://stackoverflow.com/questions/257409/download-image-file-from-the-html-page-source-using-python https://stackoverflow.com/questions/34957748/http-error-403-forbidden-with-urlretrieve https://blog.csdn.net/fengzhizi76506/article/details/59229846

tags: Python
Currently unrated
 or 

Subscribe

* indicates required

Recent Posts

Archive

2023
2022
2021

Categories

Apache 1

Data Science 2

Dbfit 1

Design Pattern 1

Devops 4

DigitalOcean 1

Django 1

English 3

Excel 5

FUN 4

Flask 3

Git 1

HackMD 1

Heroku 1

Html/Css 1

Linux 4

MDX 1

Machine Learning 2

Manufacture 1

Master Data Service 1

Mezzanine 18

Oracle 1

Postgresql 7

PowerBI 4

Powershell 4

Python 22

SEO 2

SQL Server 53

SQL Server Analytics Service 1

SQLite 1

Windows 1

database 8

work-experience 1

其他 1

投資入門 1

投資心得 2

時間管理 1

總體經濟 2

自我成長 3

資料工程 1

Tags

SEO(1) Github(2) Title Tag(2) ML(1) 李宏毅(1) SQL Server(18) Tempdb(1) SSMS(1) Windows(1) 自我成長(2) Excel(1) python Flask(1) python(5) Flask(2)

Authors

Max Chen (159)

Feeds

RSS / Atom

Python--Download image file from the HTML page source

© COPYRIGHT 2011-2022. Max的文藝復興. ALL RIGHT RESERVED.