python - Handling complex parentheses structures to get the expected data

We have data from a REST API call stored in an output file that looks as follows:

Sample Input File:

test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))

Expected Output:

bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh

Conditions to Derive the Expected Output:

Always consider the last occurrence of parentheses () in each line. We need to extract the values within this last, outermost pair of parentheses.
Inside the last occurrence of (), extract all values that appear before each occurrence of nested parentheses ().
Eg: test test123 - test (bla bla1 (On chutti)) last parenthesis starts from (bla to till chutti)) so I need bla bla1 since its before inner (On chutti). So look for the last parenthesis and then inside how many pair of parenthesis comes we need to get data before them, eg: in line testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan)) needed is Ranbir kapoor and Milkha Singh.

Attempted Regex: I tried using the following regular expression on Working Demo of regex:

Regex:

^(?:^[^(]+\([^)]+\) \(([^(]+)\([^)]+\)\))|[^(]+\(([^(]+)\([^)]+\),\s([^\(]+)\([^)]+\)\s\([^\)]+\)\)|(?:(?:.*?)\((.*?)\(.*?\)\))|(?:[^(]+\(([^)]+)\))$

The Regex that I have tried is working fine but I want to improve it with the advice of experts here.

Preferred Languages: Looking to improve this regex OR a Python, or an awk answer is also ok. I myself will also try to add an awk answer.

We have data from a REST API call stored in an output file that looks as follows:

Sample Input File:

test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))

Expected Output:

bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh

Conditions to Derive the Expected Output:

Always consider the last occurrence of parentheses () in each line. We need to extract the values within this last, outermost pair of parentheses.
Inside the last occurrence of (), extract all values that appear before each occurrence of nested parentheses ().
Eg: test test123 - test (bla bla1 (On chutti)) last parenthesis starts from (bla to till chutti)) so I need bla bla1 since its before inner (On chutti). So look for the last parenthesis and then inside how many pair of parenthesis comes we need to get data before them, eg: in line testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan)) needed is Ranbir kapoor and Milkha Singh.

Attempted Regex: I tried using the following regular expression on Working Demo of regex:

Regex:

^(?:^[^(]+\([^)]+\) \(([^(]+)\([^)]+\)\))|[^(]+\(([^(]+)\([^)]+\),\s([^\(]+)\([^)]+\)\s\([^\)]+\)\)|(?:(?:.*?)\((.*?)\(.*?\)\))|(?:[^(]+\(([^)]+)\))$

The Regex that I have tried is working fine but I want to improve it with the advice of experts here.

Preferred Languages: Looking to improve this regex OR a Python, or an awk answer is also ok. I myself will also try to add an awk answer.

Share Improve this question edited Nov 20, 2024 at 9:57 Arvind Kumar Avinash 79.8k10 gold badges92 silver badges135 bronze badges asked Nov 16, 2024 at 11:20 RavinderSingh13 134k14 gold badges61 silver badges99 bronze badges

Add a comment |

9 Answers 9

Sorted by: Reset to default 8

Any time you're considering using a lengthy and/or complicated regexp to try to solve a problem, keep in mind the quote:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Using any awk:

$ cat tst.awk
{
    rec = $0
    while ( match(rec, /\([^()]*)/) ) {
        tgt = substr($0,RSTART+1,RLENGTH-2)
        rec = substr(rec,1,RSTART-1) RS substr(rec,RSTART+1,RLENGTH-2) RS substr(rec,RSTART+RLENGTH)
    }
    gsub(/ *\([^()]*) */, "", tgt)
    print tgt
}

$ awk -f tst.awk file
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh

I'm saving a copy of $0 in rec and then in the loop I'm converting every (foo) inside rec to \nfoo\n (assuming the default RS and that the RS cannot be present in a RS-separated record) and also saving the foo from $0 (to retain the possibly nested original ( and ) pairs) in the variable tgt. So when the loop ends tgt contains the last foo substring that was present in this input record, e.g. Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan). Then with the final gsub() I remove all (...) substrings from tgt, including any surrounding blanks, leaving just the desired output.

If you can ever have more levels of parenthesised strings remaining in tgt than just 1 level deep, just change gsub(/ *\([^()]*) */, "", tgt) to while ( gsub(/ *\([^()]*) */, "", tgt) );.

Purely based on your shown input and your comments reflecting that you need to capture 1 or 2 values per line, here is an optimized regex solution:

^(?:\([^)(]*\)|[^()])*\(([^)(]+)(?:\([^)(]*\)[, ]*(?:([^)(]+))?)?

RegEx Demo

RegEx Details:

This regex solution does the following:

match everythng before last (...) then match ( then
1st group: match name that must not have ( and ) then
optional match of (...) or comma/space then
2nd group: match name that must not have ( and )

Further Details:

^: Start
(?:: Start non-capture group
- $[^\n)(]*$: Match any pair of (...) text
- |: OR
- [^()\n]: Match any character that are not (, ) and \n
)*: End non-capture group. Repeat this 0 or more times
\(: Match last (
([^)(\n]+): 1st capture group that matches text with 1+ characters that are not (, ) and \n
(?:: Start non-capture group 1
- $[^\n)(]*$: Match any pair of (...) text
- [, ]*: Match 0 or more of space or comma characters
- (?:: Start non-capture group 2
  - ([^)(\n]+): 2nd capture group that matches text with 1+ characters that are not (, ) and \n
- )?: End non-capture group 2. ? makes this an optional match
)?: End non-capture group 1. ? makes this an optional match

Assumptions/understandings:

parens ((, )) exist solely as delimiters (ie, they do not show up as part of the data)
every ( has a matching )
level=1 consists of all text not in parens
each successive ( takes us down one level
each successive ) takes us up one level
for level N (N>1) we are to display only the last matching set of data (eg, if there are 2 distinct sets of level=2 data then we only display the last data set)
the textual description does not match OP's current regex with regards to what extraneous characters are to be removed (eg, remove commas, remove trailing spaces, collapse multiple spaces to single space, etc); [NOTE: it's not clear (to me) what this means: There is NO separator in output since in Python it was coming in capturing groups. In case of awk OR without capturing group's solution, can be separated with ,]; for an initial solution we'll remove nothing; OP can always add code to strip out extraneous characters

Extending OP's current data set:

$ cat input.dat
flat line
test test123 - test (bla bla1 (On chutti)) _ level 1
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan1), Milkha Singh (On chutti) (Lagaan2))
a (b c) (d (e f), g (h i), j) k
1 (2 (3 (4 (5 5)), 3), 2 ), 1

Demonstration of levels:

1 (2 (3 (4 (5 5)), 3), 2 ), 1
^^                        ^^^ - level 1
   ^^                ^^^^     - level 2
      ^^         ^^^          - level 3
         ^^                   - level 4
            ^^^               - level 5

Demonstration of last level 2 data:

balle balle (testagain) (Rohit Sharma)
^^^^^^^^^^^^           ^               - level 1
             ^^^^^^^^^                 - level 2 (first occurrence)
                         ^^^^^^^^^^^^  - level 2 (last occurrence)

One awk idea using a recursive function to parse paren-delimited data:

awk -v lvl=2 '                                          # level to display
function parse(line, cur_lvl,    pos, char) {
    if (line == "") return
    pos  = match(line,/[()]/)                           # find 1st "(" or ")"
    char = (pos>0 ? substr(line,pos,1) : "")            # "(" or ")" ?

    if ((cur_lvl+1) == lvl && char == "(")              # if new level (>1) data set then ...
       out = ""                                         # clear previous data set

    if (cur_lvl == lvl) {                               # if at desired level then ...
       if (pos == 0) { out = out line; return }         # append; no more parens so go "up" in call stack
       else          out = out substr(line,1,pos-1)     # append
    }
    if (pos > 0)                                        # if we found a paren then recurse:
       parse(substr(line,pos+1), (char == "(" ? cur_lvl+1 : cur_lvl-1))
}

{ out     = ""                                          # init output
  cur_lvl = 1                                           # init starting level
  line    = $0                                          # make copy of $0

  parse(line, cur_lvl)                                  # start parsing

  ###  add code here to remove extraneous characters ?

  if (out != "")                                        # if we have something to print ...
     print ":" out ":"                                  # colons are added for display purposes; OP can remove once satisfied with results
}
' input.dat

Another awk solution using a linear approach to parsing (akin to Pierre's python solution)

awk -v lvl=2 '
{ out     = "" 
  cur_lvl = 1
  line    = $0

  while (pos = match(line,/[()]/)) {
        char = (pos>0 ? substr(line,pos,1) : "")
        if (cur_lvl == lvl) out = out substr(line,1,pos-1)
        if (char    == "(") { cur_lvl++; out = (cur_lvl==lvl ? "" : out) }
        if (char    == ")") cur_lvl-- 
        line = substr(line,pos+1) 
  }

  if (cur_lvl == lvl) out = out line
  if (    out != "" ) print ":" out ":" 
}
' input.dat

Taking for a test drive (both of the above awk solutions generate the same output given the same lvl setting):

For lvl=2 (OP's request):

:bla bla1 :
:Rinku Singh:
:Rohit Sharma:
:Virat kohli :
:Ranbir kapoor , Milkha Singh  :
:d , g , j:
:2 , 2 :

For lvl=1:

:flat line:
:test test123 - test  _ level 1:
:test test123 bla12 teeee :
:balle balle  :
:test test123 test1111 test45345  :
:testagain blae kaun hai ye banda :
:a   k:
:1 , 1:

For lvl=3:

:On chutti:
:Lagaan:
:Lagaan2:
:h i:
:3 , 3:

For lvl=5:

:5 5:

For lvl=6:

         # no output

Regex is generally not appropriate for parsing nested sets of parentheses.

Here is a short Python script that does what you asked:

import fileinput

for line in fileinput.input():
    line_result = ""
    parenthesis_level = 0 # Keeps track of how deep we are inside the parenthesis
    for char in line:
        if char == ")":
            parenthesis_level -= 1
        if parenthesis_level == 1 and char not in "()":
            line_result += char
        if char == "(":
            if parenthesis_level == 0: # Only keep the last outermost parenthesis
                line_result = "" # Discard any result from previous top-level parenthesis
            parenthesis_level += 1
    print(line_result)

I used fileinput for this PoC, but it should be trivial to replace it with whatever your data source is. I tried it with:

echo "test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti)" | python test.py

and got the following result:

bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor , Milkha Singh

Bonus :

As a way to make a point about not using regex for this kind of purpose, I did some light code-golfing and shortened the script above to the following :

import fileinput as i
for l in i.input():
 p=0
 for c in l:
  if c==")":p-=1
  if p==1 and c not in "()":r+=c
  if c=="(":
   if p==0:r=""
   p+= 1
 print(r)

Ignoring the import line (but still counting every character below including indents and line returns), this is 136 characters long, 14 characters shorter than the regular expression shown in the question. This shortened Python code is (in my opinion) still more readable/maintainable/extendable than any regex anyone can come up with.

If you are open to using Python with the PyPi regex module you can use duplicate group names and then use the captures("groupname") to return a list of all the captures of a group.

This regex assumes that there is max 1 level of nesting where there can be 1 or more occurrences of the same group name on that level.

\((?P<grp>[^()]+)(?:\([^()]*\)(?:,\s+(?P<grp>[^()]+)(?:\s*\([^()]*\))*)*)?\)$

The regex matches:

\( Match (
(?P<grp>[^()]+) Named grp to match from (...)
(?: Non capture group
- $[^()]*$
- (?: Non capture group
  - ,\s Match a comma and 1+ whitespace chars
  - (?P<grp>[^()]+) Named grp to match from (...)
  - (?:\s*$[^()]*$)* Optionally match 0+ whitespace chars followed by (...)
- )* Close the non capture group
)? Close the group and make it optional
\)Match )
$ End of string

See a regex demo and a Python demo

Example

import regex

strings = [
    "test test123 - test (bla bla1 (On chutti))",
    "test test123 - abc (test1 (test2)) test (bla bla1 (On chutti))",
    "test test123 bla12 teeee (Rinku Singh)",
    "balle balle (testagain) (Rohit Sharma)",
    "test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))",
    "testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))",
    "testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Ranbir kapoor (Lagaan), Milkha Singh (On chutti))",
    "testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Ranbir kapoor (Lagaan), Test 1 (Test 2) (Test 3))"
]

pattern = r"\((?P<grp>[^()]+)(?:\([^()]*\)(?:,\s(?P<grp>[^()]+)(?:\s*\([^()]*\))*)*)?\)$"

for s in strings:
    match = regex.search(pattern, s)
    if match:
        print(match.captures("grp"))

Output

['bla bla1 ']
['bla bla1 ']
['Rinku Singh']
['Rohit Sharma']
['Virat kohli ']
['Ranbir kapoor ', 'Milkha Singh ']
['Ranbir kapoor ', 'Ranbir kapoor ', 'Milkha Singh ']
['Ranbir kapoor ', 'Ranbir kapoor ', 'Test 1 ']

A few minor additions

This is not a fool proof solution as matching parenthesis can be very tricky
A part like (?:.*?) is the same as .*? You can omit the non capture group in a few sub parts of you regex, as the group by itself is not being used for an alternation and there are no quantifiers for that group
Note that a part like this .*?\)\) matches as few characters as possible followed by )) where the .* itself can also match parenthesis where it might unintentionally match too much
Looking at the regex that you tried in extended mode you can see that you are using 4 alternations where only the first and the last alternative is anchored. Using capture groups like this in separate branches might give you (in Python for example) empty matches for all the groups that have no match

You can use nestedExpr from the PyParsing module:

from pyparsing import nestedExpr
import re 

txt='''\
test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))
'''

brace_parser = nestedExpr()

for line in txt.splitlines():
    new_line=re.sub(r'^[^(]*|[^)]*$', '', line)
    for m in brace_parser.parseString(new_line):
        stack=[]
        for e in m:
            if isinstance(e, str): stack.append(e)
        print(' '.join(stack))

Prints:

bla bla1
Rinku Singh
testagain
Surya
Ranbir kapoor , Milkha Singh

Since my Input_file is always same pattern is always same with no edge cases, I will write it in this manner. Written and tested in GNU awk. Using match functions with regex inside them and using capturing groups to store values into array named arr which later on printing them as per requirement.

awk '
match($0,/\(([^)]+)\) \(([^(]+)\([^)]+\))$/,arr){
  print arr[2]
  next
}
match($0,/^[^(]+\(([^(]+)\([^)]+\)\)$/,arr){
  print arr[1]
  next
}
match($0,/\(([^)]+)\)$/,arr){
  print arr[1]
  next
}
match($0,/\(([^(]+)\([^)]+\), ([^(]+)\(.*$/,arr){
  print arr[1] ", " arr[2]
}
'  Input_file

Output will be as follows.

bla bla1 
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor , Milkha Singh

Here is my simple awk solution that does the job with just the replacements:

cat srch.awk

{
   gsub(/^(\([^)(]*\)|[^()])*\(/, "");
   gsub(/ *\([^(]*\) */, "")
   sub(/[) ]*$/, "")
}
1

Then run it as:

awk -f srch.awk file

bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh

This solution is merely a streamlining of @anubhava's answer above :

echo '
test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))' |

awk 'gsub(/^([(][^()]*[)]|[^()]+)*[(]| *[(][^(]*[)] *|[) ]*$/,_)'

bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh

Or if you prefer using FS / OFS :

awk ++NF FS='^([(][^()]*[)]|[^()]+)*[(]| *[(][^(]*[)] *|[ )]+$' OFS=

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Handling complex parentheses structures to get the expected data - Stack Overflow

9 Answers 9

与本文相关的文章

评论列表(0)