Perforce Chronicle 2012.2/486814
API Documentation

P4Cms_Filter_HtmlEntityDecode Class Reference

Converts html entities to their character equivalents. More...

List of all members.

Public Member Functions

 __construct ($options=array())
 Sets filter options (e.g.
 filter ($value)
 Convert html entities in the given string to their special character equivalents.
 getCharset ()
 Get the character set used for entity translation.
 setCharset ($charset)
 Set the character set to use for entity translation.

Public Attributes

const UTF8 = 'UTF-8'

Protected Member Functions

 _numericEntityCallback ($matches)
 Given matches from preg replace, return either the decoded numeric entity or the original entity if unable to decode.

Protected Attributes

 $_charset = self::UTF8

Detailed Description

Converts html entities to their character equivalents.

Supports hex and decimal html entities and case-insensitivity for named entities where case does not matter (e.g.   and ). Provides character set conversion - defaults to UTF-8.

Copyright:
2011-2012 Perforce Software. All rights reserved
License:
Please see LICENSE.txt in top-level folder of this distribution.
Version:
2012.2/486814

Constructor & Destructor Documentation

P4Cms_Filter_HtmlEntityDecode::__construct ( options = array())

Sets filter options (e.g.

charset).

Parameters:
string | array | Zend_Config$optionsthe character set to decode to.
Returns:
void
    {
        if ($options instanceof Zend_Config) {
            $options = $options->toArray();
        }

        if (is_array($options) && isset($options['charset'])) {
            $this->_charset = $options['charset'];
        } else if (is_string($options)) {
            $this->_charset = $options;
        }
    }

Member Function Documentation

P4Cms_Filter_HtmlEntityDecode::_numericEntityCallback ( matches) [protected]

Given matches from preg replace, return either the decoded numeric entity or the original entity if unable to decode.

Parameters:
array$matchesarray of matched elements passed from preg_replace_callback.
Returns:
string the replacement string (decoded entity).
    {
        // normalize entities to ints (unicode codepoints).
        if (strtolower($matches[1][0]) === 'x') {
            $value = hexdec(substr($matches[1], 1));
        } else {
            $value = intval($matches[1]);
        }

        // utf-32 (little-endian) encode unicode codepoint (utf-32 is easiest).
        // unicode codepoint must fit in a 32 bit number.
        if ($value > 0xFFFFFFFF) {
            return $matches[0];
        }
        $value = pack('V', $value);

        // return the converted character or the original entity on failure.
        $value = @iconv('UTF-32LE', $this->_charset, $value);
        return strlen($value) ? $value : $matches[0];
    }
P4Cms_Filter_HtmlEntityDecode::filter ( value)

Convert html entities in the given string to their special character equivalents.

Note: invalid entities are not decoded.

Parameters:
mixed$valuethe html to be decoded.
Returns:
string the html with valid entities decoded.
    {
        $mapping      = array();
        $entities     = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
        $entityCounts = array_count_values(array_map('strtolower', $entities));
        foreach ($entities as $character => $entity) {

            // translated character should be in requested charset
            // using iconv for it's better charset support.
            $character = html_entity_decode($entity, ENT_QUOTES, self::UTF8);
            $character = iconv(self::UTF8, $this->_charset, $character);
            
            $mapping[$entity] = $character;

            // some entities vary by case (e.g. &aring, &Aring), if this entity
            // has only one entry, support both upper and lower-case variations.
            if ($entityCounts[strtolower($entity)] == 1) {
                $mapping[strtoupper($entity)] = $character;
                $mapping[strtolower($entity)] = $character;
            }
        }

        // do decoding of named entities.
        $value = str_replace(array_keys($mapping), array_values($mapping), $value);

        // perform decoding of hex and decimal html entities.
        $value = preg_replace_callback(
            "/&#(x([0-9a-f][0-9a-f])+|[0-9]+);/i",
            array($this, '_numericEntityCallback'),
            $value
        );

        return $value;
    }
P4Cms_Filter_HtmlEntityDecode::getCharset ( )

Get the character set used for entity translation.

Returns:
string the charset in use.
    {
        return $this->_charset;
    }
P4Cms_Filter_HtmlEntityDecode::setCharset ( charset)

Set the character set to use for entity translation.

Parameters:
string$charsetthe target character set.
Returns:
P4Cms_Filter_HtmlEntityDecode provides fluent interface.
    {
        if (!is_string($charset)) {
            throw new InvalidArgumentException(
                "Cannot set character set. Charset must be a string."
            );
        }

        $this->_charset = $charset;

        return $this;
    }

Member Data Documentation

P4Cms_Filter_HtmlEntityDecode::$_charset = self::UTF8 [protected]

The documentation for this class was generated from the following file: