Non-correcting library for parsing/modifying broken HTML/PHP files?

Content preview: Hi, does anyone know of a library which can work with
broken/malformed
HTML/PHP and still produce the same output like the input? So far
I’ve tried
Nokogiri and Hpricot, they’re absolutely amazing and excel in their
purpose
but fail to meet my requirement that, when saving the HTML, nothing
which
I haven’t changed due DOM manipulation should change in the output.
[…]

Content analysis details: (-2.9 points, 5.0 required)

pts rule name description



-1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-Cloudmark-Analysis: v=1.1
cv=vUpxTctd+kpWCBtSXXIkt5ll4Z8E5Qu9nLREXC/hfIo= c=1 sm=0
a=aofHTkXiRO8A:10 a=kYkOTcyLgCIA:10 a=8nJEP1OIZ-IA:10 a=xqWC_Br6kY4A:10
a=FZVhte7egy-XM9mgROkA:9 a=XDagGe719OQRB085aJsA:7 a=wPNLvfGTeEIA:10
a=HpAAvcLHHh0Zw7uRqdWCyQ==:117
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk
Lines: 118
List-Id: ruby-talk.ruby-lang.org
List-Software: fml [fml 4.0.3 release (20011202/4.0.3)]
List-Post: mailto:[email protected]
List-Owner: mailto:[email protected]
List-Help: mailto:[email protected]?body=help
List-Unsubscribe: mailto:[email protected]?body=unsubscribe
Received-SPF: none (Address does not pass the Sender Policy Framework)
SPF=FROM;
[email protected];
remoteip=::ffff:221.186.184.68;
remotehost=carbon.ruby-lang.org;
helo=carbon.ruby-lang.org;
receiver=eq4.andreas-s.net;

Hi,

does anyone know of a library which can work with broken/malformed
HTML/PHP and still produce the same output like the input?

So far I’ve tried Nokogiri and Hpricot, they’re absolutely amazing and
excel in their purpose but fail to meet my requirement that, when saving
the HTML, nothing which I haven’t changed due DOM manipulation should
change in the output.

The thing is that I’ve to work with such horrible broken HTML (or say,
PHP) documents that those libraries are

On Tue, Apr 5, 2011 at 10:56 AM, Markus F. [email protected]
wrote:

troublesome for me, as I’ve fix a few hundreds, maybe up to thousands of
documents and their versioned history should really only reflect the change
I’m doing and not what the library needs to change so it can work with it. I
looked up at rubygems but was unable to come up with more libraries, did I
miss them?

What about one initial rework to get proper (X)HTML, submit it to your
version control and then create those modifications that you need to
do? That approach has served me quite well for example when enforcing
a particular source code formatting.

Cheers

robert

Content preview: Hi Robert, On 05.04.2011 14:59, Robert K. wrote: >
What
about one initial rework to get proper (X)HTML, submit it to your >
version
control and then create those modifications that you need to > do?
That approach
has served me quite well for example when enforcing > a particular
source
code formatting. […]

Content analysis details: (-2.9 points, 5.0 required)

pts rule name description



-1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-Cloudmark-Analysis: v=1.1
cv=JvXQbuMnWGQeb488dJ7w43Du7THgE+O7ieb9U20/rjk= c=1 sm=0
a=aofHTkXiRO8A:10 a=dMDiLTCNUu8A:10 a=8nJEP1OIZ-IA:10 a=xqWC_Br6kY4A:10
a=xyjLA6wklNOqHNlhfS4A:9 a=wPNLvfGTeEIA:10
a=HpAAvcLHHh0Zw7uRqdWCyQ==:117
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk
Lines: 15
List-Id: ruby-talk.ruby-lang.org
List-Software: fml [fml 4.0.3 release (20011202/4.0.3)]
List-Post: mailto:[email protected]
List-Owner: mailto:[email protected]
List-Help: mailto:[email protected]?body=help
List-Unsubscribe: mailto:[email protected]?body=unsubscribe
Received-SPF: none (Address does not pass the Sender Policy Framework)
SPF=FROM;
[email protected];
remoteip=::ffff:221.186.184.68;
remotehost=carbon.ruby-lang.org;
helo=carbon.ruby-lang.org;
receiver=eq4.andreas-s.net;

Hi Robert,

On 05.04.2011 14:59, Robert K. wrote:

What about one initial rework to get proper (X)HTML, submit it to your
version control and then create those modifications that you need to
do? That approach has served me quite well for example when enforcing
a particular source code formatting.

I considered this approach too, unfortunately it turns out it breaks the
history too much, i.e. blaming of content. I mean, nothing gets “broken”
but when you blame/annotate, and we do this, you get irrelevant noise in
it, which I really try to avoid.

thanks,

  • Markus