[Cialug] OT - manipulating blogger exported XML
Nicolai
nicolai-cialug at chocolatine.org
Sun Oct 2 19:19:45 CDT 2011
On Sun, Oct 02, 2011 at 06:11:38PM -0500, Nathan C. Smith wrote:
> Or instead use wget and pull down the whole site into html and somehow
> stitch that together into a single document from the resultant pages.
After looking at Dave's posted blog.xml file, and the contents of a
typical http://example.blogspot.com/date/file.html, I gotta second the
wget approach, maybe with a little lynx -dump magic. (Does wget have
such functionality built-in?) Those html files are heinous.
With wget + lynx -dump on the resulting files (ugly, but not as bad as
parsing blog.xml), the (1) title and (2) text body are easy to identify.
There may be a few [EMBED] type strings in the text, but those are easy
to fix.
Nicolai
More information about the Cialug
mailing list