Python etree无法解析html文本(返回NoneType)

drkbr07n  于 4个月前  发布在  Python
关注(0)|答案(1)|浏览(94)

为什么输出是“无”?它应该是“<Element html at 0x101dbd240>“或别的什么。

注意:只有在我的Mac上出现了问题,我尝试用python3.9.6用pyenv创建了一个virtualenv,也无法正常工作。

这是我的lxml版本

lxml   4.9.3

字符串
在我将变量“html”更改为其他任何内容(甚至html=“haha 123”)之后,它就可以工作了!

* 更奇怪 *!**:

代码来源于requests.get(). text。经过我的测试,原始文本的所有部分经过etree分析后都是正常的,除了这部分。

from lxml import etree
html = '''
<div class="nav">
    <div class="item-area nav-free">
        <a id="menu_free" class="item link active" href="/free/">
            <div class="item-wrap">免费代理</div>
        </a>
    </div>
    <div class="item-area nav-product">
        <div id="menu_product_list" class="item ">
            <div class="item-wrap">
                <span>产品</span><span class="dropdown iconfont icon-xiajiantou"></span>
            </div>
        </div>

        <div class="popover bottom" id="menu_product_dropdown" style="left: -222px">
            <div class="popover__arrow" style="left: 250px"></div>
            <div class="popover__content">
                <div class="top-menu">
                    <div class="menu-main">
                        <div class="product__menu-group">
                            <div class="product__menu-icon">
                                <img src="/img/v3/[email protected]" alt="隧道代理图标"/>
                            </div>
                            <a class="product__user-center r-icon" href="/usercenter/tps/">产品管理</a>
                            <a class="product__menu-title tps" href="/tps">
                                <span class="product__type-name">隧道代理</span>
                                <span class="product__desc">高性能云端自动切换代理IP服务</span>
                            </a>
                            <div class="product__menu-submenu media-pc">
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="隧道包年包月">
                                    <a href="/cart?t=tps" class="gray-link">包年包月</a>
                                </div>
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="隧道按量付费">
                                    <a href="/cart?t=tps_c" class="gray-link">按量付费</a>
                                </div>
                            </div>
                        </div>

                        <div class="product__menu-group">
                            <div class="product__menu-icon">
                                <img src="/img/v3/[email protected]" alt="私密代理图标"/>
                            </div>
                            <a class="product__user-center r-icon" href="/usercenter/dps/">产品管理</a>
                            <a class="product__menu-title dps" href="/dps">
                                <span class="product__type-name">私密代理</span>
                                <span class="product__desc">高品质动态短效代理IP服务</span>
                            </a>
                            <div class="product__menu-submenu media-pc">
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="私密集中提取">
                                    <a href="/cart?t=dps_2&c=1" class="gray-link">集中提取</a>
                                </div>
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="私密均匀提取">
                                    <a href="/cart?t=dps" class="gray-link">均匀提取</a>
                                </div>
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="私密按IP付费标准版">
                                    <a href="/cart?t=dps_c&c=1" class="gray-link">按IP付费标准版</a>
                                </div>
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="私密按IP付费专业版">
                                    <a href="/cart?t=dps_c_pro" class="gray-link">按IP付费专业版</a>
                                </div>
                            </div>
                        </div>

                        <div class="product__menu-group">
                            <div class="product__menu-icon">
                                <img src="/img/v3/[email protected]" alt="独享代理图标"/>
                            </div>
                            <a class="product__user-center r-icon" href="/usercenter/kps/">产品管理</a>
                            <a class="product__menu-title kps" href="/kps">
                                <span class="product__type-name">独享代理</span>
                                <span class="product__desc">高品质极速稳定的长效代理IP服务</span>
                            </a>
                            <div class="product__menu-submenu media-pc">
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="独享IP共享静态型">
                                    <a href="/cart?t=kps_sta" class="gray-link">共享静态型</a>
                                </div>
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="独享IP共享动态型">
                                    <a href="/cart?t=kps_dyn" class="gray-link">共享动态型</a>
                                </div>
                                <div class="submenu-item">
                                    <img src="/img/v3/[email protected]" alt="独享IP独享型">
                                    <a href="/cart?t=kps_exip" class="gray-link">独享型</a>
                                </div>
                            </div>
                        </div>
                    </div>
                    
                    <div class="menu-other media-pc">
                        <div>增值服务</div>
                        <a class="btn plugin" href="/extension">专属Chorme插件</a>
                        
                        <a class="btn student" href="/student">学生优惠产品</a>
                        <div class="applet">
                            <img src="/img/v3/[email protected]" width="96" alt="快代理云服务">
                            <p>快代理小程序</p>
                        </div>

                    </div>
                    <div class="menu-other-mobile media-m">
                        <a class="btn plugin" href="/extension"></a>
                        <a class="btn student" href="/student"></a>
                        <a class="btn appletMobile" href="javascript: void(0);"></a>
                    </div>
                </div>
            </div>
        </div>
    </div>
    <div class="item-area nav-pricing">
        <a id="menu_pricing" class="item link " href="/pricing/">
            <div class="item-wrap">定价</div>
        </a>
    </div>
    <div class="item-area nav-doc">
        <div id="menu_doc" class="item ">
            <div class="item-wrap">
                <span>文档与支持</span><span class="dropdown iconfont icon-xiajiantou"></span>
            </div>
        </div>

        <div class="popover bottom" style="left: -200px" id="menu_doc_dropdown">
            <div class="popover__arrow" style="left: 250px"></div>
            <div class="popover__content">
                <div class="top-menu">
                    <div class="menu-main">
                        <!-- 产品功能 -->
                        <div class="menu-group dev">
                            <div class="menu-type">
                                <div class="menu-type--desc">
                                    <div class="subtitle">产品与功能</div>
                                    <div class="underline media-pc"></div>
                                    <div class="flex">
                                        <div class="menu-oneLevel">
                                            <ul>
                                                <li>
                                                    <img src="/img/v3/[email protected]" alt="ip代理文档">
                                                    <a href="/helpcenter/">帮助中心</a>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <div>
                                                        <a href="/doc/product/tps/">产品介绍</a>
                                                        <div class="menu-twoLevel">
                                                            <a href="/doc/product/tps/" class="gray-link">隧道代理</a>
                                                            <a href="/doc/product/dps/" class="gray-link">私密代理</a>
                                                            <a href="/doc/product/kps/" class="gray-link">独享代理</a>
                                                            
                                                        </div>
                                                    </div>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]" alt="ip代理功能" style="margin-left:-1px;">
                                                    <a href="/doc/func/identity_overview/">功能介绍</a>
                                                </li>
                                                <li>
                                                    
                                                    <svg style="margin: 4px 4px 3px 3px;" width="13" height="13" viewBox="0 0 15 15">
                                                        <path fill="#3989ff" d="M 8,1 A 7,7 0 0 0 2.918,3.1973 L 5.1523,7.0664 A 3,3 0 0 1 8,5 H 14.31 A 7,7 0 0 0 8,1 Z M 2.2402,4.0234 A 7,7 0 0 0 1,8 7,7 0 0 0 6.3828,14.803 L 8.6172,10.934 A 3,3 0 0 1 8,11 3,3 0 0 1 5.4062,9.498 L 5.4023,9.5 Z M 10.23,6 A 3,3 0 0 1 11,8 3,3 0 0 1 10.596,9.49 L 10.598,9.5 7.4395,14.973 A 7,7 0 0 0 8,15 7,7 0 0 0 15,8 7,7 0 0 0 14.701,6 Z"/>
                                                        <path fill="#93bbff" d="M 8,6 A 2,2 0 0 0 6,8 2,2 0 0 0 8,10 2,2 0 0 0 10,8 2,2 0 0 0 8,6 Z"/>
                                                    </svg>
                                                    <a href="/extension/">浏览器插件</a>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]" alt="ip代理问题">
                                                    <div>
                                                        <a href="/doc/faq/buy/">常见问题</a>
                                                        <div class="menu-twoLevel">
                                                            <a href="/doc/faq/buy/" class="gray-link">购买问题</a>
                                                            <a href="/doc/faq/product/" class="gray-link">产品问题</a>
                                                            <a href="/doc/func/identity_faq_common/" class="gray-link">实名问题</a>
                                                            <a href="/doc/faq/invoice/" class="gray-link">发票问题</a>
                                                        </div>
                                                    </div>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <a href="/changelog/">更新日志💡</a>
                                                </li>
                                            </ul>
                                        </div>
                                    </div>

                                </div>
                            </div>
                        </div>
                        <!-- 开发指南 -->
                        <div class="menu-group">
                            <div class="menu-type">
                                <div class="menu-type--desc">
                                    <div class="subtitle">开发指南</div>
                                    <div class="underline media-pc"></div>
                                    <div class="flex">
                                        <div class="menu-oneLevel">
                                            <ul>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <a href="/doc/dev/quickstart/">快速入门</a>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <div>
                                                        <a href="/doc/dev/quickstart/" >开发手册</a>
                                                        <div class="menu-twoLevel">
                                                            <a href="/doc/dev/tps/" class="gray-link">隧道开发</a>
                                                            <a href="/doc/dev/dps/" class="gray-link">私密开发</a>
                                                            <a href="/doc/dev/kps/" class="gray-link">独享开发</a>
                                                            
                                                        </div>
                                                    </div>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <a href="/doc/dev/api/">API获取代理</a>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <a href="/doc/dev/proxy/">代理可用性测试</a>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <a href="/doc/dev/errcode/">错误码一览</a>
                                                </li>
                                            </ul>
                                        </div>
                                        <div class="menu-oneLevel">
                                            <ul>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <div>
                                                        <a href="/doc/api/#3-api" >所有API列表</a>
                                                        <div class="menu-twoLevel">
                                                            <a href="/doc/api/#3-api" class="gray-link">账号接口</a>
                                                            <a href="/doc/api/#3-api" class="gray-link">订单接口</a>
                                                            <a href="/doc/api/#3-api" class="gray-link">产品接口</a>
                                                            <a href="/doc/api/#3-api" class="gray-link">工具接口</a>
                                                        </div>
                                                    </div>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <a href="/doc/api/auth/">API授权与验证</a>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <a href="/doc/dev/sdk/">SDK&代码样例</a>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/[email protected]">
                                                    <a href="/tool/fetchua/">在线获取UA</a>
                                                </li>
                                                <li>
                                                    <img src="/img/v3/header__hc_kgtools.png">
                                                    <a href="https://kgtools.cn">爬虫工具库</a>
                                                </li>
                                            </ul>
                                        </div>
                                    </div>

                                </div>
                            </div>
                        </div>

                    </div>
                    <div class="menu-contact">
                        <div class="flex">
                            <div class="flex flex-column tel">
                                <div class="icon_text">
                                    <img src="/img/v3/[email protected]" width="16" height="16">
                                    <div>客服热线:</div>
                                </div>
                                <span class="s-text">400-058-0638</span>
                            </div>
                            <a href="http://q.url.cn/CDksXo?_type=wpa&amp;qidian=true" class="flex flex-column flex-auto qq" target="_blank">
                                <div class="icon_text">
                                    <img class="init" src="/img/v3/[email protected]" width="16" height="16">
                                    <img class="hover" src="/img/v3/[email protected]" width="16" height="16">
                                    <div>客服QQ:</div>
                                </div>
                                <span class="s-text">800849628</span>
                            </a>
                        </div>
                        <a class="btn btn-small blue-btn btn-cross online-chat" href="javascript:void(0);">售前在线咨询</a>
                        <span class="s-text">&nbsp;&nbsp;( 周一至周五 9:00 ~ 21:00 )&nbsp;&nbsp;</span>
                        <a class="btn btn-small blue-btn btn-cross is-plain" href="/support/addrequest">工单支持</a>
                        <div class="wechat">
                            <img src="/img/service_wx3.png" width="138">
                            <p>客服微信</p>
                        </div>

                    </div>
                </div>
            </div>
        </div>
    </div>
    <div class="item-area nav-safe">
        <a id="menu_safe" class="item link " href="/safe/">
            <div class="item-wrap">安全合规</div>
        </a>
    </div>
    <div class="item-area nav-cps">
        <a id="menu_cps" class="item link " href="/cps/">
            <div class="item-wrap">加入推广</div>
        </a>
    </div>
    <div class="item-area nav-free">
        <a id="menu_free" class="item link " href="/blog/">
            <div class="item-wrap">博客</div>
        </a>
    </div>
</div>
<div class="action a_hover_sec">
    <div class="media-pc unlogin">
        <a href="/usercenter/" class="btn btn-small btn-link">会员中心</a>
        <a href="/login/" class="btn btn-small btn-link">登录</a>
    </div>
    <div class="media-pc welcome-link" style="display: none">
        <a id="header_login_btn" href="/usercenter/overview" class="btn btn-small btn-link"><span class="welcome"></span></a>
    </div>
    <div class="media-pc header-register">
        <a id="header_regist_btn" href="/regist/" class="btn btn-small blue-btn link-mc"><span>免费注册</span></a>
        <span id="noti"></span>
    </div>
    <div class="media-m">
        <div class="login-anonym">
            <a href="/login/" class="btn btn-small">登录</a>
            <a href="/regist/" class="btn btn-small blue-btn">注册</a>
        </div>
        <div class="login-user">
            <a href="/usercenter/" class="btn btn-small blue-btn">会员中心</a>
        </div>
    </div>

    <div id="top_account_navpop" class="nav-pop-v3">
        <div class="userInfo">
            <div class="account flex">
                
                    
                        <b>...</b>
                        <span class="tag tag-lightgray">未实名</span>
                    
                
            </div>
            <div class="balance">可用余额<a href="/usercenter/overview"><b class="warning">¥<span id="top_account_balance"></span></b></a>
            </div>
            <div class="flex order">
                <a id="v3_top_account_ordercount" class="gray-link flex flex-column" href="/usercenter/orderlist">
                    <b></b>
                    <span>我的订单</span>
                </a>
                <a id="v3_top_account_coupon" class="gray-link flex flex-column" href="/usercenter/coupon/">
                    <b>0</b>
                    <span>优惠券<i class=""></i></span>
                </a>
                <a id="v3_top_account_unread_count" class="gray-link flex flex-column" href="/usercenter/message/">
                    <b></b>
                    <span>未读消息<i class=""></i></span>
                </a>
            </div>
            <div class="flex setting">
                <a href="/usercenter/recharge" class="btn btn-small warning-btn flex-auto icon icon1">账户充值</a>
                <a href="/usercenter/orderlist/" class="btn btn-small blue-btn flex-auto icon icon1">订单管理</a>
            </div>
            <div class="user-action">
                <div class="flex">
                    <a class="icon_text gray-link flex-column" href="/usercenter/payhistory">
                        <img src="/img/v3/[email protected]" width="20" height="20">
                        <span>我的账单</span>
                    </a>
                    <a class="icon_text gray-link flex-column flex-auto" href="/usercenter/autorenew">
                        <img src="/img/v3/[email protected]" width="20" height="20">
                        <span>自动续费</span>
                    </a>
                    <a class="icon_text gray-link flex-column" href="/usercenter/invoice">
                        <img src="/img/v3/[email protected]" width="20" height="20">
                        <span>发票申请</span>
                    </a>
                </div>
                <div class="flex">
                    <a class="icon_text gray-link flex-column" href="/usercenter/invoicelist">
                        <img src="/img/v3/[email protected]" width="20" height="20">
                        <span>开票记录</span>
                    </a>
                    <a class="icon_text gray-link flex-column flex-auto" href="/logout">
                        <img src="/img/v3/[email protected]" width="20" height="20">
                        <span>退出登录</span>
                    </a>
                    <a class="icon_text gray-link flex-column" href="javascript: void(0)"></a>
                </div>
            </div>
        </div>
    </div>
</div>'''

print(etree.HTML(html))

2skhul33

2skhul331#

emoji问题是lxml中一个已知的issue问题,已经持续了很多年,现在你可以用一些变通方法来解决它。
总结这个发现(归功于网络上的原作者),我们有第一个解决方案,它使用soupparser(对于大的html树来说,它可能效率低下):

# solution 1
from lxml.html import soupparser

html = "<html><body>💡</body></html>"
dom = soupparser.fromstring(html)

字符串
或者,从另一个链接的issue中,可以将html字符串编码为asplane,用xml表示替换不支持的“实体”。

# solution 2
from lxml import etree

html = "<html><body>💡</body></html>"
dom = etree.HTML(html.encode("ascii", "xmlcharrefreplace").decode("ascii"))


当然,人们可以只剥离表情符号,这就是它。由你决定,在这种情况下,emoji包可能有助于找到他们所有。

相关问题