Skip to content

Instantly share code, notes, and snippets.

@prvn-pa
Last active June 28, 2021 18:04
Show Gist options
  • Select an option

  • Save prvn-pa/2fd37a5178268de6f979199d59622a32 to your computer and use it in GitHub Desktop.

Select an option

Save prvn-pa/2fd37a5178268de6f979199d59622a32 to your computer and use it in GitHub Desktop.
Download a webpage and convert it to EPUB

WEB PAGE TO EPUB CONVERSION

1. USING WGET AND PANDOC

  1. Create a list with the links to be downloaded. Name it something like list.txt
  2. Now download entire site data using the command wget -E -H -k -p -i list.txt
  3. Now all the data should be available offline.
  4. Go to html files folder and issue the command pandoc -i filename-*.html -o filename.epub
  5. Alternatively creat an 'index.html' file and include all the links in the body of the file. Like <a href="https://venmurasu.in/mazhaippadal/chapter-5">1</a>
  6. Now use calibre to convert the index.html to EPUB by ebook-convert index.html book.epub --title Title --authors Author --cover cover.png

Sample index.html file

<!doctype html>
<html>
  <head>
    <title>This is the title of the webpage!</title>
  </head>
  <body>
	<a href="chapter-1.html">1</a>
	<a href="chapter-2.html">2</a>
	<a href="chapter-3.html">3</a>
	<a href="chapter-4.html">4</a>
  </body>
</html>

Note on pandoc link list file

  • If the links ends with number in ascending order copy the first link to LibreOffice Sheets and extend it to till the last number to generate all the link.

To Delete Unwanted Strings in the scrapped text

Input required strings in a file (e.g. delete.txt). Each string should be entered in a new line as below:

:::
venmurasu.jpg
.items-center
.p-1
.text
.feather
.ml-3
data:image
# [[]{.icon
.items
.border
.transition
Home
.icon
.mb-4

Now use grep command to filter the strings. grep -vf delete.txt filename.md > tmpfile && mv tmpfile newname.md

2. Using Chrome Extension

  • Install EpubPress in the browser (Tested on chrome - Extension Link).
  • Open the webpages (to be converted into EPUB) in different tabs
  • Click the extension and select the pages and process

Limitations of EpubPress

  • Books are limited to containing 50 articles.
  • Books must be 10 Mb or less for email delivery to work.
  • Images in an article must be 1 Mb or less. Images that exceed this limit will be removed.
  • No more than 30 images will be downloaded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment