In this tutorial we examine ways of specifying WHERE in the source string we should look for a match.
Other aspects of pattern matching will be kept constant throughout the tutorial, namely:
- "WHAT" will be a single character
- "HOW MANY" will be one repetition of the text
- "WHICH" occurrence will be the first occurrence.
Regular Expressions versus LIKE
Before regular expressions, pattern matching was typically done with the LIKE operator. When using this operator a pattern like 'abc' would only match 'abc', not '---abc' or '---abc---'. To match a string that contains 'abc' anywhere inside it, the pattern '%abc%' would be used.
execute set_target( 'abc' ) execute add_target( '---abc' ) execute add_target( '---abc---' ) select 'abc' as pattern, target from targets where target like 'abc';
PATTERN TARGET --------------- ---------- abc abc
select '%abc%' as pattern, target from targets where target like '%abc%';
PATTERN TARGET
--------------- ----------
%abc% abc
---abc
---abc---
Regular expressions take the opposite approach. The pattern 'abc' matches any target containing 'abc' and a special pattern like '^abc$' must be used to match only the target 'abc'.
select 'abc' as pattern, target from targets where regexp_like( target, 'abc' );
PATTERN TARGET
--------------- ----------
abc abc
---abc
---abc---
select '^abc$' as pattern, target from targets where regexp_like( target, '^abc$' );
PATTERN TARGET --------------- ---------- ^abc$ abc
Matching the Beginning or End of a Line
As we saw in the tutorial HOW MANY, a target string like 'abc' is treated as a sequence of empty strings and characters.
(empty string) a (empty string) b (empty string) c (empty string)
A multi-line string with a Linefeed (LF) in it, like 'abc(LF)def' is treated as follows.
(empty string) a (empty string) b (empty string) c (empty string) (LF) (empty string) d (empty string) e (empty string) f (empty string)
This treatment enables us to specify two special locations within the target, the beginning of lines and the end of lines.
By default the character '^' matches the first empty string in a target and the character '$' matches the last empty string in a target. In multi-line mode (initiated using the 'm' parameter in REGEXP conditions and functions) '^' matches the first empty string in the target *and* the first empty string after any Linefeed characters. The character '$' performs similarly for the last empty string in a target or line. The following examples illustrate this. The symbol (^) indicates an empty string matched by '^' and the symbol ($) indicates an empty string matched by '$'.
Default Mode:
(^) a (empty string) b (empty string) c (empty string) (LF) (empty string) d (empty string) e (empty string) f ($)
Multi-line Mode:
(^) a (empty string) b (empty string) c ($) (LF) (^) d (empty string) e (empty string) f ($)
The following examples show how '^' and '$' operate in default mode. In the target data, 'unistr('\000A')' represents a Linefeed character and the target value 'abc(LF)def' appears as 'abc' on one line and 'def' on the next.
execute set_pattern( '^' )
execute add_pattern( '$' )
execute add_pattern( '^.' )
execute add_pattern( '.$' )
execute add_pattern( 'a' )
execute add_pattern( '^a' )
execute add_pattern( '^d' )
execute add_pattern( 'c$' )
execute add_pattern( 'f' )
execute add_pattern( 'f$' )
execute set_target( 'abcdef' )
execute add_target( 'abc' || unistr('\000A') || 'def' )
set pagesize 60
set recsep each
break on pattern
select
pattern ,
target ,
regexp_substr( target, pattern ) as matched_value ,
regexp_instr( target, pattern ) as position
from
patterns ,
targets
order by
sort_key ,
target desc
;
PATTERN TARGET MATCHED_VALUE POSITION
--------------- ---------- --------------- --------
^ abcdef (null) 1
abc (null) 1
def
$ abcdef (null) 7
abc (null) 8
def
^. abcdef a 1
abc a 1
def
.$ abcdef f 6
abc f 7
def
a abcdef a 1
abc a 1
def
^a abcdef a 1
abc a 1
def
^d abcdef (null) 0
abc (null) 0
def
c$ abcdef (null) 0
abc (null) 0
def
f abcdef f 6
abc f 7
def
f$ abcdef f 6
abc f 7
def
Using the same patterns in multi-line mode gives us these results.
select pattern , target , regexp_substr( target, pattern, 1, 1, 'm' ) as matched_value , regexp_instr( target, pattern, 1, 1, 0, 'm' ) as position from patterns , targets order by sort_key , target desc ;
PATTERN TARGET MATCHED_VALUE POSITION
--------------- ---------- --------------- --------
^ abcdef (null) 1
abc (null) 1
def
$ abcdef (null) 7
abc (null) 4
def
^. abcdef a 1
abc a 1
def
.$ abcdef f 6
abc c 3
def
a abcdef a 1
abc a 1
def
^a abcdef a 1
abc a 1
def
^d abcdef (null) 0
abc d 5
def
c$ abcdef (null) 0
abc c 3
def
f abcdef f 6
abc f 7
def
f$ abcdef f 6
abc f 7
def
set recsep wrapped break on pattern skip 1
For the six character target 'abcdef' note that REGEXP_INSTR returns '1' for both '^' and 'a'. Contrast this with the results for 'f' and '$' which are '6' and '7' respectively. One might expect REGEXP_INSTR to return "0" for '^', to be consistent with the behaviour of "$", but it cannot because '0' is used to indicate that no match was found. Be aware of this difference when using REGEXP_INSTR with '^' and '$'.
Starting the Search From Somewhere Other Than Position One
By default Oracle searches the target string starting from the first character in the target string. To start from a different position, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR accept a "position" parameter to set the starting point.
Starting from position 1 (no position parameter is used):
execute set_target( 'abc1--abc2' ) select target , 'abc.' as pattern , regexp_substr( target, 'abc.' ) as matched_value , regexp_instr( target, 'abc.' ) as position from targets ;
TARGET PATTERN MATCHED_VALUE POSITION ---------- --------------- --------------- -------- abc1--abc2 abc. abc1 1
Starting from position 5 (position parameter is set to "5"):
select target , 'abc.' as pattern , regexp_substr( target, 'abc.', 5 ) as matched_value , regexp_instr( target, 'abc.', 5 ) as position from targets ;
TARGET PATTERN MATCHED_VALUE POSITION ---------- --------------- --------------- -------- abc1--abc2 abc. abc2 7
Note that, regardless of which location you start the search at, REGEXP_INSTR always returns a position relative to the start of the entire target.
Unlike the three REGEXP functions, the REGEXP_LIKE condition does not accept a position parameter. The following examples show how to work around this. They all search for targets that have the string 'abc' anywhere at or after the fifth character in a string.
execute set_target( 'abc' )
execute add_target( 'abc---def' )
execute add_target( 'def---abc' )
select target
from targets
where regexp_like( target, '^.{4}.*abc' ) ;
TARGET ---------- def---abc
select target from targets where regexp_like( substr( target, 5 ), 'abc' ) ;
TARGET ---------- def---abc
