3

NiftyArticles – extract articles from any webpage with Python, Webkit and Readability

Posted September 2nd, 2010 in Python by Florentin

What is NiftyArticles

NiftyArticles is a small Python script which lets you identify and save articles from inside webpages. It is based on Readability.

Update 10.02.2011

I have made 2 versions of the script, one is using QT4 and the other is based on GTK.
The QT4 version is slightly better.

Installation

Each version depends on readability.js, please download that file and place it in the same directory as the niftyarticle script.

The installation details for each version is located inside the script file.

Usage (QT4 version)

To get a full list of options, run:
python niftyarticles-qt.py –help
OR if you made it executable,
/path/to/niftyarticles-qt.py –help

# save an article from ubuntu.com
python niftyarticles-qt.py -u http://www.ubuntu.com/project/derivatives

Help for the QT4 version

The script has been developed and tested on Python 2.6.6 / Ubuntu Maverick

Install the required packages: python-qt4 libqt4-webkit
Ubuntu or Debian
sudo apt-get install python2.6 python-qt4 libqt4-webkit

(Optional) Mark the script as executable
chmod +x niftyarticles-qt.py

Usage Examples:
python niftyarticles-qt.py --help

# save an article from ubuntu.com
python niftyarticles-qt.py -u http://www.ubuntu.com/project/derivatives

# show the browser window, load images, enable java applets
python niftyarticles-qt.py -b -i -v -u http://www.ubuntu.com/project/canonical-and-ubuntu

# to run the script on a server with no GUI available, you must install xvfb
sudo apt-get install xvfb
# then you can do
xvfb-run -a python niftyarticles-qt.py -u http://www.ubuntu.com/project/derivatives

Webkit Reference

http://webkitgtk.org/reference/index.html

Download the NiftyArticles from GitHub

3 Responses so far.

  1. wHOevERest says:

    Is there a Windows version of the module? Can I somehow import these packages in Windows or am I screwed? :)

  2. Thank you, for this helpfull scripts.

    I have problem with charaters outside ASCII. My LANG=cs_CZ.UTF-8 and I’m getting even on english pages in some parts incorrect characters.

    » minimalist FAQs :mnmlist

    This can be fixed by adding this line into document at beginning:

    so result is as expected:

    » minimalist FAQs :mnmlist

    Can this be added in some clean way into your scripts?

    Thank you,
    Jiri

  3. Florentin says:

    Hi Jiri,

    Which version of the script causes the Unicode errors ?
    Could you please provide an example of usage where the result is faulty ?

    Thank you.

Leave a Reply