In this tutorial we examine ways of specifying WHERE in the source string we should look for a match.
Other aspects of pattern matching will be kept constant throughout the tutorial, namely:
- "WHAT" will be a single character
- "HOW MANY" will be one repetition of the text
- "WHICH" occurrence will be the first occurrence.
Regular Expressions versus LIKE
Before regular expressions, pattern matching was typically done with the LIKE operator. When using this operator a pattern like 'abc' would only match 'abc', not '---abc' or '---abc---'. To match a string that contains 'abc' anywhere inside it, the pattern '%abc%' would be used.
execute set_target( 'abc' ) execute add_target( '---abc' ) execute add_target( '---abc---' ) select 'abc' as pattern, target from targets where target like 'abc';
PATTERN TARGET --------------- ---------- abc abc
select '%abc%' as pattern, target from targets where target like '%abc%';
PATTERN TARGET --------------- ---------- %abc% abc ---abc ---abc---
Regular expressions take the opposite approach. The pattern 'abc' matches any target containing 'abc' and a special pattern like '^abc$' must be used to match only the target 'abc'.
select 'abc' as pattern, target from targets where regexp_like( target, 'abc' );
PATTERN TARGET --------------- ---------- abc abc ---abc ---abc---
select '^abc$' as pattern, target from targets where regexp_like( target, '^abc$' );
PATTERN TARGET --------------- ---------- ^abc$ abc
Matching the Beginning or End of a Line
As we saw in the tutorial HOW MANY, a target string like 'abc' is treated as a sequence of empty strings and characters.
(empty string) a (empty string) b (empty string) c (empty string)
A multi-line string with a Linefeed (LF) in it, like 'abc(LF)def' is treated as follows.
(empty string) a (empty string) b (empty string) c (empty string) (LF) (empty string) d (empty string) e (empty string) f (empty string)
This treatment enables us to specify two special locations within the target, the beginning of lines and the end of lines.
By default the character '^' matches the first empty string in a target and the character '$' matches the last empty string in a target. In multi-line mode (initiated using the 'm' parameter in REGEXP conditions and functions) '^' matches the first empty string in the target *and* the first empty string after any Linefeed characters. The character '$' performs similarly for the last empty string in a target or line. The following examples illustrate this. The symbol (^) indicates an empty string matched by '^' and the symbol ($) indicates an empty string matched by '$'.
Default Mode:
(^) a (empty string) b (empty string) c (empty string) (LF) (empty string) d (empty string) e (empty string) f ($)
Multi-line Mode:
(^) a (empty string) b (empty string) c ($) (LF) (^) d (empty string) e (empty string) f ($)
The following examples show how '^' and '$' operate in default mode. In the target data, 'unistr('\000A')' represents a Linefeed character and the target value 'abc(LF)def' appears as 'abc' on one line and 'def' on the next.
execute set_pattern( '^' ) execute add_pattern( '$' ) execute add_pattern( '^.' ) execute add_pattern( '.$' ) execute add_pattern( 'a' ) execute add_pattern( '^a' ) execute add_pattern( '^d' ) execute add_pattern( 'c$' ) execute add_pattern( 'f' ) execute add_pattern( 'f$' ) execute set_target( 'abcdef' ) execute add_target( 'abc' || unistr('\000A') || 'def' ) set pagesize 60 set recsep each break on pattern select pattern , target , regexp_substr( target, pattern ) as matched_value , regexp_instr( target, pattern ) as position from patterns , targets order by sort_key , target desc ;
PATTERN TARGET MATCHED_VALUE POSITION --------------- ---------- --------------- -------- ^ abcdef (null) 1 abc (null) 1 def $ abcdef (null) 7 abc (null) 8 def ^. abcdef a 1 abc a 1 def .$ abcdef f 6 abc f 7 def a abcdef a 1 abc a 1 def ^a abcdef a 1 abc a 1 def ^d abcdef (null) 0 abc (null) 0 def c$ abcdef (null) 0 abc (null) 0 def f abcdef f 6 abc f 7 def f$ abcdef f 6 abc f 7 def
Using the same patterns in multi-line mode gives us these results.
select pattern , target , regexp_substr( target, pattern, 1, 1, 'm' ) as matched_value , regexp_instr( target, pattern, 1, 1, 0, 'm' ) as position from patterns , targets order by sort_key , target desc ;
PATTERN TARGET MATCHED_VALUE POSITION --------------- ---------- --------------- -------- ^ abcdef (null) 1 abc (null) 1 def $ abcdef (null) 7 abc (null) 4 def ^. abcdef a 1 abc a 1 def .$ abcdef f 6 abc c 3 def a abcdef a 1 abc a 1 def ^a abcdef a 1 abc a 1 def ^d abcdef (null) 0 abc d 5 def c$ abcdef (null) 0 abc c 3 def f abcdef f 6 abc f 7 def f$ abcdef f 6 abc f 7 def
set recsep wrapped break on pattern skip 1
For the six character target 'abcdef' note that REGEXP_INSTR returns '1' for both '^' and 'a'. Contrast this with the results for 'f' and '$' which are '6' and '7' respectively. One might expect REGEXP_INSTR to return "0" for '^', to be consistent with the behaviour of "$", but it cannot because '0' is used to indicate that no match was found. Be aware of this difference when using REGEXP_INSTR with '^' and '$'.
Starting the Search From Somewhere Other Than Position One
By default Oracle searches the target string starting from the first character in the target string. To start from a different position, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR accept a "position" parameter to set the starting point.
Starting from position 1 (no position parameter is used):
execute set_target( 'abc1--abc2' ) select target , 'abc.' as pattern , regexp_substr( target, 'abc.' ) as matched_value , regexp_instr( target, 'abc.' ) as position from targets ;
TARGET PATTERN MATCHED_VALUE POSITION ---------- --------------- --------------- -------- abc1--abc2 abc. abc1 1
Starting from position 5 (position parameter is set to "5"):
select target , 'abc.' as pattern , regexp_substr( target, 'abc.', 5 ) as matched_value , regexp_instr( target, 'abc.', 5 ) as position from targets ;
TARGET PATTERN MATCHED_VALUE POSITION ---------- --------------- --------------- -------- abc1--abc2 abc. abc2 7
Note that, regardless of which location you start the search at, REGEXP_INSTR always returns a position relative to the start of the entire target.
Unlike the three REGEXP functions, the REGEXP_LIKE condition does not accept a position parameter. The following examples show how to work around this. They all search for targets that have the string 'abc' anywhere at or after the fifth character in a string.
execute set_target( 'abc' ) execute add_target( 'abc---def' ) execute add_target( 'def---abc' ) select target from targets where regexp_like( target, '^.{4}.*abc' ) ;
TARGET ---------- def---abc
select target from targets where regexp_like( substr( target, 5 ), 'abc' ) ;
TARGET ---------- def---abc