-
-
[分享]使用Calibre recipe爬取离线博客,制作精美Kindle电子书
-
发表于: 2021-10-2 11:31 9165
-
本文用到的离线博客、代码、生成的电子书都可以在这个链接找到https://github.com/evmn/Nicholas-C.-Zakas/
最近一段时间对Calibre兴趣极大,又比较喜欢Nicholas C.Zakas的博客,所以就有了这个项目。
搭建本地服务器
把项目拖到本地,到humanwhocodes.com
目录开一个http服务器:
1 | python3 - m http.server 8000 |
调试recipe脚本
然后用下面的命令调试脚本,使用--test
参数时,程序只会下载几篇博文,让你看看效果。可以一直微调到满意为止。
1 | ebook - convert Human_Who_Codes.recipe .mobi - - test - vv - - debug - pipeline debug |
我最终生成电子书用的脚本是这样的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | # encoding: utf-8 from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup from urllib2 import urlopen from datetime import datetime base_url = 'http://127.0.0.1:8000/' class Human_Who_Codes(BasicNewsRecipe): title = 'Human Who Codes' description = u 'Hi, I’m Nicholas C. Zakas, an independent software developer living in Mountain View, California. I’ve been a software architect at companies like Yahoo and Box, as well as an author and speaker. I created the ESLint open source project and wrote several books. At the moment, I’m recovering from Lyme disease and haven’t been able to leave my home much in the past six years.' cover_url = 'http://127.0.0.1:8000/images/cover.jpg' # masthead_url = '' remove_tags_before = dict ( id = 'article' ) remove_tags_after = dict ( id = 'article' ) #remove_tags = ['footer'] remove_tags = [ dict (attrs = { 'class' :[ 'grid-columns' , 'col-organic' , 'nav' , 'highlight-background' , 'tags' , 'post-meta' ]}), dict ( id = [ 'sidebar' , 'thread__wrapper' ]), dict (attrs = { 'itemprop' :[ 'description' ]}), dict (name = [ 'script' , 'noscript' , 'style' , 'footer' , 'hr' ])] __author__ = 'Nicholas C. Zakas' language = 'en' encoding = 'utf-8' timefmt = '' # extra_css = 'h1 {font: sans-serif;}\n.byline {font:monospace;}' #keep_only_tags = [{ 'class': 'example' }] no_stylesheets = True resolve_internal_links = True remove_javascript = True auto_cleanup = False delay = 1 simultaneous_downloads = 5 oldest_article = 999 max_articles_per_feed = 999 def parse_index( self ): soup = self .index_to_soup(base_url) archives = soup.find( 'div' , id = 'sidebar' ).findAll( 'ul' )[ 3 ] feeds = [] desc = '' for section in archives.findAll( 'a' ): articles = [] secname = section.getText() sec_url = base_url + section[ 'href' ] sec = urlopen(sec_url) blogs = BeautifulSoup(sec.read(), 'html.parser' ).find( 'main' , id = 'content' ).findAll( 'li' ) for blog in blogs: date = datetime.strptime(blog.find( 'small' ).getText(), '%b %d, %Y' ).strftime( "%m-%d: " ) title = date + blog.find( 'a' ).getText() link = base_url + "blog/" + secname + "/" + blog.find( 'a' )[ 'href' ] print ( "<li><a href=" + link + ">" + title + "</a><br></li>" ) articles.append({ 'title' : title, 'url' : link}) feeds.append((secname, reversed (articles))) # feeds.append((secname, articles)) return feeds |
使用Calibre下载
打开Calibre,点击Fetch news右边的小三角,选择Add or edit a custom news source,使用高级模式编辑,把上面的脚本复制进去,并保存。
然后确保没有开系统代理(比如privoxy 8118,或者127.0.0.1不走代理),到Fetch news选择刚保存的新闻源,点击下载。
命令行下下载
可以根据自己的需求修改参数,执行ebook-convert a.recipe .mobi --help
可以看到用recipe生成mobi可以使用的所有参数。下面是我为Kindle Paperwhite 3
制作电子书时选择的参数:
1 2 3 4 5 6 7 | ebook - convert "Human_Who_Codes.recipe" .mobi \ - - authors = "Nicholas C. Zakas" \ - - title = "Human Who Codes" \ - - pubdate = "2021-11-09" \ - - output - profile = kindle_pw3 \ - - mobi - file - type = new \ - vv |
Kindle阅读效果图
参考链接
[注意]传递专业知识、拓宽行业人脉——看雪讲师团队等你加入!
最后于 2021-11-19 08:18
被Explorerl编辑
,原因:
赞赏
他的文章
看原图
赞赏
雪币:
留言: