We have data from a REST API call stored in an output file that looks as follows:
Sample Input File:
test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))
Expected Output:
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh
Conditions to Derive the Expected Output:
- Always consider the last occurrence of parentheses () in each line. We need to extract the values within this last, outermost pair of parentheses.
- Inside the last occurrence of (), extract all values that appear before each occurrence of nested parentheses ().
- Eg:
test test123 - test (bla bla1 (On chutti))
last parenthesis starts from(bla
to tillchutti))
so I needbla bla1
since its before inner(On chutti)
. So look for the last parenthesis and then inside how many pair of parenthesis comes we need to get data before them, eg: in linetestagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))
needed isRanbir kapoor
andMilkha Singh
.
Attempted Regex: I tried using the following regular expression on Working Demo of regex:
Regex:
^(?:^[^(]+\([^)]+\) \(([^(]+)\([^)]+\)\))|[^(]+\(([^(]+)\([^)]+\),\s([^\(]+)\([^)]+\)\s\([^\)]+\)\)|(?:(?:.*?)\((.*?)\(.*?\)\))|(?:[^(]+\(([^)]+)\))$
The Regex that I have tried is working fine but I want to improve it with the advice of experts here.
Preferred Languages: Looking to improve this regex OR a Python, or an awk
answer is also ok. I myself will also try to add an awk
answer.
We have data from a REST API call stored in an output file that looks as follows:
Sample Input File:
test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))
Expected Output:
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh
Conditions to Derive the Expected Output:
- Always consider the last occurrence of parentheses () in each line. We need to extract the values within this last, outermost pair of parentheses.
- Inside the last occurrence of (), extract all values that appear before each occurrence of nested parentheses ().
- Eg:
test test123 - test (bla bla1 (On chutti))
last parenthesis starts from(bla
to tillchutti))
so I needbla bla1
since its before inner(On chutti)
. So look for the last parenthesis and then inside how many pair of parenthesis comes we need to get data before them, eg: in linetestagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))
needed isRanbir kapoor
andMilkha Singh
.
Attempted Regex: I tried using the following regular expression on Working Demo of regex:
Regex:
^(?:^[^(]+\([^)]+\) \(([^(]+)\([^)]+\)\))|[^(]+\(([^(]+)\([^)]+\),\s([^\(]+)\([^)]+\)\s\([^\)]+\)\)|(?:(?:.*?)\((.*?)\(.*?\)\))|(?:[^(]+\(([^)]+)\))$
The Regex that I have tried is working fine but I want to improve it with the advice of experts here.
Preferred Languages: Looking to improve this regex OR a Python, or an awk
answer is also ok. I myself will also try to add an awk
answer.
9 Answers
Reset to default 8Any time you're considering using a lengthy and/or complicated regexp to try to solve a problem, keep in mind the quote:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Using any awk:
$ cat tst.awk
{
rec = $0
while ( match(rec, /\([^()]*)/) ) {
tgt = substr($0,RSTART+1,RLENGTH-2)
rec = substr(rec,1,RSTART-1) RS substr(rec,RSTART+1,RLENGTH-2) RS substr(rec,RSTART+RLENGTH)
}
gsub(/ *\([^()]*) */, "", tgt)
print tgt
}
$ awk -f tst.awk file
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh
I'm saving a copy of $0
in rec
and then in the loop I'm converting every (foo)
inside rec
to \nfoo\n
(assuming the default RS
and that the RS
cannot be present in a RS
-separated record) and also saving the foo
from $0
(to retain the possibly nested original (
and )
pairs) in the variable tgt
. So when the loop ends tgt
contains the last foo
substring that was present in this input record, e.g. Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan)
. Then with the final gsub()
I remove all (...)
substrings from tgt
, including any surrounding blanks, leaving just the desired output.
If you can ever have more levels of parenthesised strings remaining in tgt
than just 1 level deep, just change gsub(/ *\([^()]*) */, "", tgt)
to while ( gsub(/ *\([^()]*) */, "", tgt) );
.
Purely based on your shown input and your comments reflecting that you need to capture 1 or 2 values per line, here is an optimized regex solution:
^(?:\([^)(]*\)|[^()])*\(([^)(]+)(?:\([^)(]*\)[, ]*(?:([^)(]+))?)?
RegEx Demo
RegEx Details:
This regex solution does the following:
- match everythng before last (...) then match ( then
- 1st group: match name that must not have ( and ) then
- optional match of (...) or comma/space then
- 2nd group: match name that must not have ( and )
Further Details:
^
: Start(?:
: Start non-capture group\([^\n)(]*\)
: Match any pair of(...)
text|
: OR[^()\n]
: Match any character that are not(
,)
and\n
)*
: End non-capture group. Repeat this 0 or more times\(
: Match last(
([^)(\n]+)
: 1st capture group that matches text with 1+ characters that are not(
,)
and\n
(?:
: Start non-capture group 1\([^\n)(]*\)
: Match any pair of(...)
text[, ]*
: Match 0 or more of space or comma characters(?:
: Start non-capture group 2([^)(\n]+)
: 2nd capture group that matches text with 1+ characters that are not(
,)
and\n
)?
: End non-capture group 2.?
makes this an optional match
)?
: End non-capture group 1.?
makes this an optional match
Assumptions/understandings:
- parens (
(
,)
) exist solely as delimiters (ie, they do not show up as part of the data) - every
(
has a matching)
- level=1 consists of all text not in parens
- each successive
(
takes us down one level - each successive
)
takes us up one level - for level N (N>1) we are to display only the last matching set of data (eg, if there are 2 distinct sets of level=2 data then we only display the last data set)
- the textual description does not match OP's current regex with regards to what extraneous characters are to be removed (eg, remove commas, remove trailing spaces, collapse multiple spaces to single space, etc); [NOTE: it's not clear (to me) what this means:
There is NO separator in output since in Python it was coming in capturing groups. In case of awk OR without capturing group's solution, can be separated with ,
]; for an initial solution we'll remove nothing; OP can always add code to strip out extraneous characters
Extending OP's current data set:
$ cat input.dat
flat line
test test123 - test (bla bla1 (On chutti)) _ level 1
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan1), Milkha Singh (On chutti) (Lagaan2))
a (b c) (d (e f), g (h i), j) k
1 (2 (3 (4 (5 5)), 3), 2 ), 1
Demonstration of levels:
1 (2 (3 (4 (5 5)), 3), 2 ), 1
^^ ^^^ - level 1
^^ ^^^^ - level 2
^^ ^^^ - level 3
^^ - level 4
^^^ - level 5
Demonstration of last level 2 data:
balle balle (testagain) (Rohit Sharma)
^^^^^^^^^^^^ ^ - level 1
^^^^^^^^^ - level 2 (first occurrence)
^^^^^^^^^^^^ - level 2 (last occurrence)
One awk
idea using a recursive function to parse paren-delimited data:
awk -v lvl=2 ' # level to display
function parse(line, cur_lvl, pos, char) {
if (line == "") return
pos = match(line,/[()]/) # find 1st "(" or ")"
char = (pos>0 ? substr(line,pos,1) : "") # "(" or ")" ?
if ((cur_lvl+1) == lvl && char == "(") # if new level (>1) data set then ...
out = "" # clear previous data set
if (cur_lvl == lvl) { # if at desired level then ...
if (pos == 0) { out = out line; return } # append; no more parens so go "up" in call stack
else out = out substr(line,1,pos-1) # append
}
if (pos > 0) # if we found a paren then recurse:
parse(substr(line,pos+1), (char == "(" ? cur_lvl+1 : cur_lvl-1))
}
{ out = "" # init output
cur_lvl = 1 # init starting level
line = $0 # make copy of $0
parse(line, cur_lvl) # start parsing
### add code here to remove extraneous characters ?
if (out != "") # if we have something to print ...
print ":" out ":" # colons are added for display purposes; OP can remove once satisfied with results
}
' input.dat
Another awk
solution using a linear approach to parsing (akin to Pierre's python solution)
awk -v lvl=2 '
{ out = ""
cur_lvl = 1
line = $0
while (pos = match(line,/[()]/)) {
char = (pos>0 ? substr(line,pos,1) : "")
if (cur_lvl == lvl) out = out substr(line,1,pos-1)
if (char == "(") { cur_lvl++; out = (cur_lvl==lvl ? "" : out) }
if (char == ")") cur_lvl--
line = substr(line,pos+1)
}
if (cur_lvl == lvl) out = out line
if ( out != "" ) print ":" out ":"
}
' input.dat
Taking for a test drive (both of the above awk
solutions generate the same output given the same lvl
setting):
For lvl=2
(OP's request):
:bla bla1 :
:Rinku Singh:
:Rohit Sharma:
:Virat kohli :
:Ranbir kapoor , Milkha Singh :
:d , g , j:
:2 , 2 :
For lvl=1
:
:flat line:
:test test123 - test _ level 1:
:test test123 bla12 teeee :
:balle balle :
:test test123 test1111 test45345 :
:testagain blae kaun hai ye banda :
:a k:
:1 , 1:
For lvl=3
:
:On chutti:
:Lagaan:
:Lagaan2:
:h i:
:3 , 3:
For lvl=5
:
:5 5:
For lvl=6
:
# no output
Regex is generally not appropriate for parsing nested sets of parentheses.
Here is a short Python script that does what you asked:
import fileinput
for line in fileinput.input():
line_result = ""
parenthesis_level = 0 # Keeps track of how deep we are inside the parenthesis
for char in line:
if char == ")":
parenthesis_level -= 1
if parenthesis_level == 1 and char not in "()":
line_result += char
if char == "(":
if parenthesis_level == 0: # Only keep the last outermost parenthesis
line_result = "" # Discard any result from previous top-level parenthesis
parenthesis_level += 1
print(line_result)
I used fileinput
for this PoC, but it should be trivial to replace it with whatever your data source is. I tried it with:
echo "test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti)" | python test.py
and got the following result:
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor , Milkha Singh
Bonus :
As a way to make a point about not using regex for this kind of purpose, I did some light code-golfing and shortened the script above to the following :
import fileinput as i
for l in i.input():
p=0
for c in l:
if c==")":p-=1
if p==1 and c not in "()":r+=c
if c=="(":
if p==0:r=""
p+= 1
print(r)
Ignoring the import line (but still counting every character below including indents and line returns), this is 136 characters long, 14 characters shorter than the regular expression shown in the question. This shortened Python code is (in my opinion) still more readable/maintainable/extendable than any regex anyone can come up with.
If you are open to using Python with the PyPi regex module you can use duplicate group names and then use the captures("groupname")
to return a list of all the captures of a group.
This regex assumes that there is max 1 level of nesting where there can be 1 or more occurrences of the same group name on that level.
\((?P<grp>[^()]+)(?:\([^()]*\)(?:,\s+(?P<grp>[^()]+)(?:\s*\([^()]*\))*)*)?\)$
The regex matches:
\(
Match(
(?P<grp>[^()]+)
Namedgrp
to match from(...)
(?:
Non capture group\([^()]*\)
(?:
Non capture group,\s
Match a comma and 1+ whitespace chars(?P<grp>[^()]+)
Namedgrp
to match from(...)
(?:\s*\([^()]*\))*
Optionally match 0+ whitespace chars followed by(...)
)*
Close the non capture group
)?
Close the group and make it optional\)
Match)
$
End of string
See a regex demo and a Python demo
Example
import regex
strings = [
"test test123 - test (bla bla1 (On chutti))",
"test test123 - abc (test1 (test2)) test (bla bla1 (On chutti))",
"test test123 bla12 teeee (Rinku Singh)",
"balle balle (testagain) (Rohit Sharma)",
"test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))",
"testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))",
"testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Ranbir kapoor (Lagaan), Milkha Singh (On chutti))",
"testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Ranbir kapoor (Lagaan), Test 1 (Test 2) (Test 3))"
]
pattern = r"\((?P<grp>[^()]+)(?:\([^()]*\)(?:,\s(?P<grp>[^()]+)(?:\s*\([^()]*\))*)*)?\)$"
for s in strings:
match = regex.search(pattern, s)
if match:
print(match.captures("grp"))
Output
['bla bla1 ']
['bla bla1 ']
['Rinku Singh']
['Rohit Sharma']
['Virat kohli ']
['Ranbir kapoor ', 'Milkha Singh ']
['Ranbir kapoor ', 'Ranbir kapoor ', 'Milkha Singh ']
['Ranbir kapoor ', 'Ranbir kapoor ', 'Test 1 ']
A few minor additions
- This is not a fool proof solution as matching parenthesis can be very tricky
- A part like
(?:.*?)
is the same as.*?
You can omit the non capture group in a few sub parts of you regex, as the group by itself is not being used for an alternation and there are no quantifiers for that group - Note that a part like this
.*?\)\)
matches as few characters as possible followed by))
where the.*
itself can also match parenthesis where it might unintentionally match too much - Looking at the regex that you tried in extended mode you can see that you are using 4 alternations where only the first and the last alternative is anchored. Using capture groups like this in separate branches might give you (in Python for example) empty matches for all the groups that have no match
You can use nestedExpr
from the PyParsing module:
from pyparsing import nestedExpr
import re
txt='''\
test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))
'''
brace_parser = nestedExpr()
for line in txt.splitlines():
new_line=re.sub(r'^[^(]*|[^)]*$', '', line)
for m in brace_parser.parseString(new_line):
stack=[]
for e in m:
if isinstance(e, str): stack.append(e)
print(' '.join(stack))
Prints:
bla bla1
Rinku Singh
testagain
Surya
Ranbir kapoor , Milkha Singh
Since my Input_file is always same pattern is always same with no edge cases, I will write it in this manner. Written and tested in GNU awk
. Using match
functions with regex inside them and using capturing groups to store values into array named arr
which later on printing them as per requirement.
awk '
match($0,/\(([^)]+)\) \(([^(]+)\([^)]+\))$/,arr){
print arr[2]
next
}
match($0,/^[^(]+\(([^(]+)\([^)]+\)\)$/,arr){
print arr[1]
next
}
match($0,/\(([^)]+)\)$/,arr){
print arr[1]
next
}
match($0,/\(([^(]+)\([^)]+\), ([^(]+)\(.*$/,arr){
print arr[1] ", " arr[2]
}
' Input_file
Output will be as follows.
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor , Milkha Singh
Here is my simple awk
solution that does the job with just the replacements:
cat srch.awk
{
gsub(/^(\([^)(]*\)|[^()])*\(/, "");
gsub(/ *\([^(]*\) */, "")
sub(/[) ]*$/, "")
}
1
Then run it as:
awk -f srch.awk file
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh
This solution is merely a streamlining of @anubhava
's answer above :
echo '
test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))' |
awk 'gsub(/^([(][^()]*[)]|[^()]+)*[(]| *[(][^(]*[)] *|[) ]*$/,_)'
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh
Or if you prefer using FS
/ OFS
:
awk ++NF FS='^([(][^()]*[)]|[^()]+)*[(]| *[(][^(]*[)] *|[ )]+$' OFS=