Skip to content

Instantly share code, notes, and snippets.

@jsanti
Created November 23, 2010 17:52
Show Gist options
  • Select an option

  • Save jsanti/712183 to your computer and use it in GitHub Desktop.

Select an option

Save jsanti/712183 to your computer and use it in GitHub Desktop.

Revisions

  1. jsanti created this gist Nov 23, 2010.
    26 changes: 26 additions & 0 deletions repair_utf8_iso88591_mix.perl
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,26 @@
    #!/usr/bin/perl
    #
    # Repairing broken documents that mix UTF-8 and ISO-8859-1
    # http://plasmasturm.org/log/416/
    #

    use strict;
    use warnings;

    use Encode qw( decode FB_QUIET );

    binmode STDIN, ':bytes';
    binmode STDOUT, ':encoding(UTF-8)';

    my $out;

    while ( <> ) {
    $out = '';
    while ( length ) {
    # consume input string up to the first UTF-8 decode error
    $out .= decode( "utf-8", $_, FB_QUIET );
    # consume one character; all octets are valid Latin-1
    $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
    }
    print $out;
    }