Python替换除换行符之外的不可打印字符

62o28rlo 于 7个月前发布在 Python

关注(0)|答案(4)|浏览(70)

我正在尝试写一个函数，用空格替换不可打印的字符，这很好，但它也用空格替换linebreak \n。我不知道为什么。
测试代码：

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x20-\x7E]', ' ', input_string)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

字符串
输出量：

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters: Hello World This is 28a 29test.

型
正如你所看到的，Hello World之前的换行符被空格取代了，这不是故意的。我试图从ChatGPT获得帮助，但它的正则表达式解决方案不起作用。
我最后的办法是使用for循环并使用python内置的isprintable()方法来过滤字符，但与regex相比，这将慢得多。

python-3.x

来源：https://stackoverflow.com/questions/77382208/python-replace-unprintable-characters-except-linebreak

4条答案

按热度按时间

8ftvxx2r1#

你不需要使用 re，你可以使用内置的功能。
只需构建一个转换表，然后使用str.translate（）
举例来说：

TEST_STRING = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
TDICT = {c: " " for c in range(32) if c != 10}
print(TEST_STRING.translate(TDICT))

字符串

输出：

This is a test string with some unprintable characters:
Hello
World This
is 28a 29test.

型

注：

一旦你识别出了正确的正则表达式，re 比 str.translate 快**多

赞(0）回复(0）举报 7个月前

2eafrhcq2#

受Carlo Arenas答案启发修改的正则表达式。
代码：

import re

def replace_unknown_characters_with_space(input_string):
    # Replace all non printable ascii, excluding \n from the expression
    cleaned_string = re.sub(r'[^\x20-\x7E\n]', ' ', input_string, flags=re.DOTALL)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\nis\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

字符串
输出

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters:
Hello World This
is 28a 29test.

型
\n不再被替换

赞(0）回复(0）举报 7个月前

dced5bon3#

这个问题似乎有多个部分，所以让我们单独解决它们

为什么'\n'会受到影响？

'\n'是正则表达式的特殊字符，因为它们被设计为在行中操作，'\n'表示行尾。
正如你所发现的，你也可以在你想要匹配的文本中包含'\n'，但是需要让RE引擎意识到它不应该特殊对待它，为此你可以使用re.DOTALL标志。

什么是“可打印”？

“printable”比ChatGPT建议的含义更广泛，这似乎是POSIX ASCII类[：print：]的翻译（我会推荐[：graph：]）;从您的测试中看，您可能对删除打印时会影响输出的“有趣字符”更感兴趣。
您的测试包括字符，可能是ChatGPT误译的，用于UTF-8空格（\s是识别Regex中的字符的更好选择，Python不会对大于255的代码点使用\x，而是使用类似数字的\u，因此它们可能来自PCRE语法）
因为你包含了UTF-8空格，而python字符串是UTF-8，所以过滤掉“有趣的UTF-8字符”（如BIDI控制类）似乎是合乎逻辑的，如果你打算稍后打印该字符串，这将具有与'\r'类似的效果。
如果你认为任何非ASCII字符是“有趣的”，那么解决方案也需要改变。
您的示例的以下版本（带有修正的测试文本和一些扩展）可以被认为是“正确的”，但我怀疑在您细化需求时需要进一步的更改。

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable ASCII characters (including escape sequences) with spaces
    cleaned_string = re.sub(r'[^][\w\n!"\#$%&\'()*+,./:;<=>?@\\\^_`{|}~-]', ' ', input_string, flags=re.DOTALL)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0AThis\x0Dis\u2028a\u2029test. including some punctuation like `({~})' and even \\, and \" + words like <año> or numbers like \u1bb1\nText cant be \033[1m[bold]\033[0m or go \u2067backwards\u2069, but can also contain wide numbers like \uff11 or ０"

    print("Original String:")
    print(test_string)

    cleaned_string = replace_unknown_characters_with_space(test_string)

    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

字符串

赞(0）回复(0）举报 7个月前

d8tt03nd4#

相反，跳过\x0A

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x00-\x09\x11-\x1F]', ' ', input_string)

    return cleaned_string

字符串

赞(0）回复(0）举报 7个月前

我来回答

Python替换除换行符之外的不可打印字符

4条答案

相关问题

热门标签

最新问答