[分享]使用Calibre recipe爬取离线博客，制作精美Kindle电子书-安全工具-看雪-安全社区|安全招聘|kanxue.com

[分享]使用Calibre recipe爬取离线博客，制作精美Kindle电子书

发表于: 2021-10-2 11:31 9369

[分享]使用Calibre recipe爬取离线博客，制作精美Kindle电子书

Explorerl

2021-10-2 11:31

9369

本文用到的离线博客、代码、生成的电子书都可以在这个链接找到https://github.com/evmn/Nicholas-C.-Zakas/

最近一段时间对Calibre兴趣极大，又比较喜欢Nicholas C.Zakas的博客，所以就有了这个项目。

搭建本地服务器

把项目拖到本地，到humanwhocodes.com目录开一个http服务器：

1	`python3` `-m http.server` `8000`

调试recipe脚本

然后用下面的命令调试脚本，使用--test参数时，程序只会下载几篇博文，让你看看效果。可以一直微调到满意为止。

1	`ebook-convert Human_Who_Codes.recipe .mobi` `--test` `-vv` `--debug-pipeline debug`

我最终生成电子书用的脚本是这样的。

# encoding: utf-8
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from datetime import datetime
base_url = 'http://127.0.0.1:8000/'
 
class Human_Who_Codes(BasicNewsRecipe):
 
        title = 'Human Who Codes'
        description = u'Hi, I’m Nicholas C. Zakas, an independent software developer living in Mountain View, California. I’ve been a software architect at companies like Yahoo and Box, as well as an author and speaker. I created the ESLint open source project and wrote several books. At the moment, I’m recovering from Lyme disease and haven’t been able to leave my home much in the past six years.'
        cover_url = 'http://127.0.0.1:8000/images/cover.jpg'
#        masthead_url = ''
        remove_tags_before = dict(id='article')
        remove_tags_after= dict(id='article')
        #remove_tags = ['footer']
        remove_tags = [dict(attrs={'class':['grid-columns', 'col-organic', 'nav', 'highlight-background', 'tags', 'post-meta']}),
                dict(id=['sidebar', 'thread__wrapper']),
                dict(attrs={'itemprop':['description']}),
                dict(name=['script', 'noscript', 'style', 'footer', 'hr'])]
        __author__ = 'Nicholas C. Zakas'
        language = 'en'
        encoding = 'utf-8'
        timefmt = ''
#        extra_css = 'h1 {font: sans-serif;}\n.byline {font:monospace;}'
 
        #keep_only_tags = [{ 'class': 'example' }]
        no_stylesheets = True
        resolve_internal_links = True
        remove_javascript = True
        auto_cleanup = False
        delay = 1
        simultaneous_downloads = 5
        oldest_article = 999
        max_articles_per_feed = 999
 
        def parse_index(self):
                soup = self.index_to_soup(base_url)
                archives = soup.find('div', id='sidebar').findAll('ul')[3]
                feeds = []
                desc = ''
                for section in archives.findAll('a'):
                        articles = []
                        secname = section.getText()
                        sec_url = base_url + section['href']
 
                        sec = urlopen(sec_url)
                        blogs = BeautifulSoup(sec.read(), 'html.parser').find('main', id='content').findAll('li')
 
                        for blog in blogs:
                                date = datetime.strptime(blog.find('small').getText(), '%b %d, %Y').strftime("%m-%d: ")
                                title = date + blog.find('a').getText()
                                link =  base_url + "blog/" + secname + "/" + blog.find('a')['href']
                                print("<li><a href=" + link +">" + title  + "</a><br></li>")
 
                                articles.append({'title': title, 'url': link})
 
                        feeds.append((secname, reversed(articles)))
#                        feeds.append((secname, articles))
                return feeds

使用Calibre下载

打开Calibre，点击Fetch news右边的小三角，选择Add or edit a custom news source，使用高级模式编辑，把上面的脚本复制进去，并保存。

然后确保没有开系统代理(比如privoxy 8118，或者127.0.0.1不走代理)，到Fetch news选择刚保存的新闻源，点击下载。

命令行下下载

可以根据自己的需求修改参数，执行ebook-convert a.recipe .mobi --help可以看到用recipe生成mobi可以使用的所有参数。下面是我为Kindle Paperwhite 3制作电子书时选择的参数：

ebook-convert "Human_Who_Codes.recipe" .mobi \
        --authors="Nicholas C. Zakas" \
        --title="Human Who Codes" \
        --pubdate="2021-11-09" \
        --output-profile=kindle_pw3 \
        --mobi-file-type=new \
        -vv