Python替换除换行符之外的不可打印字符

62o28rlo  于 7个月前  发布在  Python
关注(0)|答案(4)|浏览(70)

我正在尝试写一个函数,用空格替换不可打印的字符,这很好,但它也用空格替换linebreak \n。我不知道为什么。
测试代码:

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x20-\x7E]', ' ', input_string)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

字符串
输出量:

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters: Hello World This is 28a 29test.


正如你所看到的,Hello World之前的换行符被空格取代了,这不是故意的。我试图从ChatGPT获得帮助,但它的正则表达式解决方案不起作用。
我最后的办法是使用for循环并使用python内置的isprintable()方法来过滤字符,但与regex相比,这将慢得多。

8ftvxx2r

8ftvxx2r1#

你不需要使用 re,你可以使用内置的功能。
只需构建一个转换表,然后使用str.translate()
举例来说:

TEST_STRING = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
TDICT = {c: " " for c in range(32) if c != 10}
print(TEST_STRING.translate(TDICT))

字符串

输出:

This is a test string with some unprintable characters:
Hello
World This
is 28a 29test.

注:

一旦你识别出了正确的正则表达式,restr.translate 快**多

2eafrhcq

2eafrhcq2#

受Carlo Arenas答案启发修改的正则表达式。
代码:

import re

def replace_unknown_characters_with_space(input_string):
    # Replace all non printable ascii, excluding \n from the expression
    cleaned_string = re.sub(r'[^\x20-\x7E\n]', ' ', input_string, flags=re.DOTALL)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\nis\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

字符串
输出

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters:
Hello World This
is 28a 29test.


\n不再被替换

dced5bon

dced5bon3#

这个问题似乎有多个部分,所以让我们单独解决它们

为什么'\n'会受到影响?

'\n'是正则表达式的特殊字符,因为它们被设计为在行中操作,'\n'表示行尾。
正如你所发现的,你也可以在你想要匹配的文本中包含'\n',但是需要让RE引擎意识到它不应该特殊对待它,为此你可以使用re.DOTALL标志。

什么是“可打印”?

“printable”比ChatGPT建议的含义更广泛,这似乎是POSIX ASCII类[:print:]的翻译(我会推荐[:graph:]);从您的测试中看,您可能对删除打印时会影响输出的“有趣字符”更感兴趣。
您的测试包括字符,可能是ChatGPT误译的,用于UTF-8空格(\s是识别Regex中的字符的更好选择,Python不会对大于255的代码点使用\x,而是使用类似数字的\u,因此它们可能来自PCRE语法)
因为你包含了UTF-8空格,而python字符串是UTF-8,所以过滤掉“有趣的UTF-8字符”(如BIDI控制类)似乎是合乎逻辑的,如果你打算稍后打印该字符串,这将具有与'\r'类似的效果。
如果你认为任何非ASCII字符是“有趣的”,那么解决方案也需要改变。
您的示例的以下版本(带有修正的测试文本和一些扩展)可以被认为是“正确的”,但我怀疑在您细化需求时需要进一步的更改。

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable ASCII characters (including escape sequences) with spaces
    cleaned_string = re.sub(r'[^][\w\n!"\#$%&\'()*+,./:;<=>?@\\\^_`{|}~-]', ' ', input_string, flags=re.DOTALL)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0AThis\x0Dis\u2028a\u2029test. including some punctuation like `({~})' and even \\, and \" + words like <año> or numbers like \u1bb1\nText cant be \033[1m[bold]\033[0m or go \u2067backwards\u2069, but can also contain wide numbers like \uff11 or 0"

    print("Original String:")
    print(test_string)

    cleaned_string = replace_unknown_characters_with_space(test_string)

    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

字符串

d8tt03nd

d8tt03nd4#

相反,跳过\x0A

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x00-\x09\x11-\x1F]', ' ', input_string)

    return cleaned_string

字符串

相关问题