任务是找到一组中最长的序列
The task is to find the longest sequence of a group
例如,给定DNA序列: AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC ,它有7次AGATC。 (AGATC)匹配所有匹配项。 是否可以编写仅捕获最长序列的正则表达式,即给定文本中的 AGATCAGATCAGATCAGATCAGATC ? 如果仅使用正则表达式是不可能的,我如何遍历每个序列(即第一个序列是 AGATCAGATC ,第二个序列是 AGATCAGATCAGATCAGATCAGATC 等)在Python中?
for instance, given DNA sequence: "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC" and it has 7 occurrences of AGATC. (AGATC) matches all occurrences. Is it possible to write a regular expression that catches only the longest sequence, i.e. AGATCAGATCAGATCAGATCAGATC in the given text? If this is not possible only with regex, how can I iterate through each sequence (i.e. 1st sequence is AGATCAGATC, 2nd - AGATCAGATCAGATCAGATCAGATC et cetera) in python?
推荐答案使用:
import re sequence = "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC" matches = re.findall(r'(?:AGATC)+', sequence) # To find the longest subsequence longest = max(matches, key=len)说明:
非捕获组(?: AGATC)+
- + 量词-一次和无限次匹配,例如
- AGATC 字面上匹配字符AGATC(区分大小写)
- + Quantifier — Matches between one and unlimited times, as many times as possible.
- AGATC matches the characters AGATC literally (case sensitive)
结果:
# print(matches) ['AGATCAGATC', 'AGATCAGATCAGATCAGATCAGATC'] # print(longest) 'AGATCAGATCAGATCAGATCAGATC'您可以测试正则表达式 此处 。
You can test the regex here.