Ungkapan biasa di Java, Bahagian 1: Padankan corak dan kelas Pola

Karakter Java dan kelas rentetan pelbagai menawarkan sokongan tahap rendah untuk pemadanan corak, tetapi sokongan itu biasanya membawa kepada kod yang kompleks. Untuk pengekodan yang lebih sederhana dan cekap, Java menawarkan Regex API. Tutorial dua bahagian ini membantu anda memulakan ungkapan biasa dan Regex API. Mula-mula kami membongkar tiga kelas hebat yang terdapat dalam java.util.regexpakej, kemudian kami akan meneroka Patternkelas dan konstruk pencocokan coraknya yang canggih.

muat turun Dapatkan kod Muat turun kod sumber misalnya aplikasi dalam tutorial ini. Dicipta oleh Jeff Friesen untuk JavaWorld.

Apakah ungkapan biasa?

A ungkapan biasa , juga dikenali sebagai regex atau regexp , adalah rentetan yang corak (template) menerangkan satu set tali. Corak menentukan rentetan mana yang termasuk dalam set. Corak terdiri daripada watak literal dan metacharacters , yang merupakan watak yang mempunyai makna khas dan bukannya makna literal.

Padankan corak adalah proses mencari teks untuk mengenal pasti padanan , atau rentetan yang sesuai dengan corak regex. Java menyokong pemadanan corak melalui Regex API-nya. API ini terdiri daripada tiga classes-- Pattern, Matcherdan PatternSyntaxException--Semua terletak di java.util.regexpakej:

  • Patternobjek, juga dikenali sebagai corak , adalah regex tersusun.
  • Matcherobjek, atau pemadan , adalah mesin yang menafsirkan corak untuk mencari padanan dalam urutan watak (objek yang kelasnya melaksanakan java.lang.CharSequenceantara muka dan berfungsi sebagai sumber teks).
  • PatternSyntaxException objek menerangkan corak regex haram.

Java juga memberikan sokongan untuk pemadanan corak melalui pelbagai kaedah di java.lang.Stringkelasnya. Sebagai contoh, boolean matches(String regex)mengembalikan nilai benar hanya jika tali pemanggil betul-betul sesuai regexdengan regex.

Kaedah kemudahan

Di sebalik tabir, matches()dan Stringkaedah kemudahan berorientasi regex lain dilaksanakan dari segi API Regex.

RegexDemo

Saya telah mencipta RegexDemopermohonan untuk menunjukkan ungkapan Jawa biasa dan pelbagai kaedah yang terletak di Pattern, Matcherdan PatternSyntaxExceptionkelas. Inilah kod sumber untuk demo:

Penyenaraian 1. Menunjukkan regex

import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.regex.PatternSyntaxException; public class RegexDemo { public static void main(String[] args) { if (args.length != 2) { System.err.println("usage: java RegexDemo regex input"); return; } // Convert new-line (\n) character sequences to new-line characters. args[1] = args[1].replaceAll("\\\\n", "\n"); try { System.out.println("regex = " + args[0]); System.out.println("input = " + args[1]); Pattern p = Pattern.compile(args[0]); Matcher m = p.matcher(args[1]); while (m.find()) System.out.println("Found [" + m.group() + "] starting at " + m.start() + " and ending at " + (m.end() - 1)); } catch (PatternSyntaxException pse) { System.err.println("Bad regex: " + pse.getMessage()); System.err.println("Description: " + pse.getDescription()); System.err.println("Index: " + pse.getIndex()); System.err.println("Incorrect pattern: " + pse.getPattern()); } } }

Perkara pertama RegexDemo's main()kaedah dilakukan adalah untuk mengesahkan baris arahan itu. Ini memerlukan dua argumen: argumen pertama adalah regex, dan argumen kedua adalah teks input untuk dipadankan dengan regex.

Anda mungkin ingin menentukan \nwatak baris baru ( ) sebagai sebahagian daripada teks input. Satu-satunya cara untuk mencapai ini adalah dengan menentukan \watak yang diikuti oleh nwatak. main()menukar urutan watak ini kepada nilai Unicode 10.

Sebilangan besar RegexDemokod terletak di try- catchkonstruk. The tryblok pertama output regex dan input teks yang tertentu dan kemudian mencipta Patternobjek yang menyimpan regex yang disusun. (Regex disusun untuk meningkatkan prestasi semasa pemadanan corak.) Pencocokan diekstrak dari Patternobjek dan digunakan untuk mencari perlawanan berulang kali sehingga tidak ada yang tersisa. The catchblok menyembah sesuatu pelbagai PatternSyntaxExceptionkaedah untuk mendapatkan maklumat yang berguna tentang pengecualian. Maklumat ini kemudiannya dikeluarkan.

Anda tidak perlu mengetahui lebih lanjut mengenai cara kerja kod sumber pada masa ini; akan menjadi jelas apabila anda meneroka API di Bahagian 2. Anda perlu menyusun Penyenaraian 1, bagaimanapun. Dapatkan kod dari Penyenaraian 1, kemudian ketik yang berikut ke dalam baris arahan anda untuk menyusun RegexDemo:

javac RegexDemo.java

Corak dan binaannya

Pattern, yang pertama dari tiga kelas yang terdiri daripada Regex API, adalah penyusunan representasi ungkapan biasa. PatternDokumentasi SDK menerangkan pelbagai konstruksi regex, tetapi melainkan jika anda sudah menjadi pengguna regex yang gemar, anda mungkin keliru dengan bahagian dokumentasi. Apa itu pengukur dan apa perbezaan antara pengukur tamak , enggan , dan posesif ? Apakah kelas watak , pemadan batas , rujukan belakang , dan ungkapan bendera tertanam ? Saya akan menjawab soalan-soalan ini dan banyak lagi di bahagian seterusnya.

Rentetan literal

Konstruk regex paling mudah adalah rentetan literal. Sebilangan teks input mesti sesuai dengan corak konstruk ini agar berjaya mencocokkan corak. Pertimbangkan contoh berikut:

java RegexDemo apple applet

Contoh ini cuba mengetahui sama ada terdapat padanan untuk applecorak dalam appletteks input. Keluaran berikut menunjukkan kesesuaian:

regex = apple input = applet Found [apple] starting at 0 and ending at 4

Output rancangan kami regex dan input teks, kemudian menunjukkan padanan yang berjaya appledalam applet. Selain itu, ia menunjukkan indeks permulaan dan akhir padanan itu: 0dan 4, masing-masing. Indeks permulaan mengenal pasti lokasi teks pertama di mana padanan corak berlaku; indeks akhir mengenal pasti lokasi teks terakhir untuk perlawanan.

Sekarang anggap kita tentukan baris perintah berikut:

java RegexDemo apple crabapple

Kali ini, kami mendapat perlawanan berikut dengan indeks permulaan dan akhir yang berbeza:

regex = apple input = crabapple Found [apple] starting at 4 and ending at 8

Senario terbalik, yang appletmerupakan regex dan appleteks input, tidak menunjukkan padanan. Keseluruhan regex mesti sepadan, dan dalam hal ini teks input tidak berisi tsetelah apple.

Metakarakter

Konstruk regex yang lebih kuat menggabungkan watak literal dengan metacharacters. Sebagai contoh, dalam a.b, periode metacharacter ( .) mewakili sebarang watak yang muncul antara adan b. Pertimbangkan contoh berikut:

java RegexDemo .ox "The quick brown fox jumps over the lazy ox."

Contoh ini menetapkan .oxsebagai regex dan The quick brown fox jumps over the lazy ox.sebagai teks input. RegexDemomencari teks untuk perlawanan yang bermula dengan watak apa pun dan diakhiri dengan ox. Ia menghasilkan output berikut:

regex = .ox input = The quick brown fox jumps over the lazy ox. Found [fox] starting at 16 and ending at 18 Found [ ox] starting at 39 and ending at 41

The output reveals two matches: fox and ox (with the leading space character). The . metacharacter matches the f in the first match and the space character in the second match.

What happens when we replace .ox with the period metacharacter? That is, what output results from specifying the following command line:

java RegexDemo . "The quick brown fox jumps over the lazy ox."

Because the period metacharacter matches any character, RegexDemo outputs a match for each character (including the terminating period character) in the input text:

regex = . input = The quick brown fox jumps over the lazy ox. Found [T] starting at 0 and ending at 0 Found [h] starting at 1 and ending at 1 Found [e] starting at 2 and ending at 2 Found [ ] starting at 3 and ending at 3 Found [q] starting at 4 and ending at 4 Found [u] starting at 5 and ending at 5 Found [i] starting at 6 and ending at 6 Found [c] starting at 7 and ending at 7 Found [k] starting at 8 and ending at 8 Found [ ] starting at 9 and ending at 9 Found [b] starting at 10 and ending at 10 Found [r] starting at 11 and ending at 11 Found [o] starting at 12 and ending at 12 Found [w] starting at 13 and ending at 13 Found [n] starting at 14 and ending at 14 Found [ ] starting at 15 and ending at 15 Found [f] starting at 16 and ending at 16 Found [o] starting at 17 and ending at 17 Found [x] starting at 18 and ending at 18 Found [ ] starting at 19 and ending at 19 Found [j] starting at 20 and ending at 20 Found [u] starting at 21 and ending at 21 Found [m] starting at 22 and ending at 22 Found [p] starting at 23 and ending at 23 Found [s] starting at 24 and ending at 24 Found [ ] starting at 25 and ending at 25 Found [o] starting at 26 and ending at 26 Found [v] starting at 27 and ending at 27 Found [e] starting at 28 and ending at 28 Found [r] starting at 29 and ending at 29 Found [ ] starting at 30 and ending at 30 Found [t] starting at 31 and ending at 31 Found [h] starting at 32 and ending at 32 Found [e] starting at 33 and ending at 33 Found [ ] starting at 34 and ending at 34 Found [l] starting at 35 and ending at 35 Found [a] starting at 36 and ending at 36 Found [z] starting at 37 and ending at 37 Found [y] starting at 38 and ending at 38 Found [ ] starting at 39 and ending at 39 Found [o] starting at 40 and ending at 40 Found [x] starting at 41 and ending at 41 Found [.] starting at 42 and ending at 42

Quoting metacharacters

To specify . or any metacharacter as a literal character in a regex construct, quote the metacharacter in one of the following ways:

  • Precede the metacharacter with a backslash character.
  • Place the metacharacter between \Q and \E (e.g., \Q.\E).

Remember to double each backslash character (as in \\. or \\Q.\\E) that appears in a string literal such as String regex = "\\.";. Don't double the backslash character when it appears as part of a command-line argument.

Character classes

We sometimes need to limit characters that will produce matches to a specific character set. For example, we might search text for vowels a, e, i, o, and u, where any occurrence of a vowel indicates a match. A character class identifies a set of characters between square-bracket metacharacters ([ ]), helping us accomplish this task. Pattern supports simple, negation, range, union, intersection, and subtraction character classes. We'll look at all of these below.

Simple character class

The simple character class consists of characters placed side by side and matches only those characters. For example, [abc] matches characters a, b, and c.

Consider the following example:

java RegexDemo [csw] cave

This example matches only c with its counterpart in cave, as shown in the following output:

regex = [csw] input = cave Found [c] starting at 0 and ending at 0

Negation character class

The negation character class begins with the ^ metacharacter and matches only those characters not located in that class. For example, [^abc] matches all characters except a, b, and c.

Consider this example:

java RegexDemo "[^csw]" cave

Note that the double quotes are necessary on my Windows platform, whose shell treats the ^ character as an escape character.

This example matches a, v, and e with their counterparts in cave, as shown here:

regex = [^csw] input = cave Found [a] starting at 1 and ending at 1 Found [v] starting at 2 and ending at 2 Found [e] starting at 3 and ending at 3

Range character class

The range character class consists of two characters separated by a hyphen metacharacter (-). All characters beginning with the character on the left of the hyphen and ending with the character on the right of the hyphen belong to the range. For example, [a-z] matches all lowercase alphabetic characters. It's equivalent to specifying [abcdefghijklmnopqrstuvwxyz].

Consider the following example:

java RegexDemo [a-c] clown

This example matches only c with its counterpart in clown, as shown:

regex = [a-c] input = clown Found [c] starting at 0 and ending at 0

Merging multiple ranges

You can merge multiple ranges into the same range character class by placing them side by side. For example, [a-zA-Z] matches all lowercase and uppercase alphabetic characters.

Union character class

The union character class consists of multiple nested character classes and matches all characters that belong to the resulting union. For example, [a-d[m-p]] matches characters a through d and m through p.

Consider the following example:

java RegexDemo [ab[c-e]] abcdef

This example matches a, b, c, d, and e with their counterparts in abcdef:

regex = [ab[c-e]] input = abcdef Found [a] starting at 0 and ending at 0 Found [b] starting at 1 and ending at 1 Found [c] starting at 2 and ending at 2 Found [d] starting at 3 and ending at 3 Found [e] starting at 4 and ending at 4

Intersection character class

The intersection character class consists of characters common to all nested classes and matches only common characters. For example, [a-z&&[d-f]] matches characters d, e, and f.

Consider the following example:

java RegexDemo "[aeiouy&&[y]]" party

Perhatikan bahawa tanda kutip ganda diperlukan pada platform Windows saya, yang shell memperlakukan &watak sebagai pemisah perintah.

Contoh ini hanya sesuai ydengan rakan sejawatnya dalam party:

regex = [aeiouy&&[y]] input = party Found [y] starting at 4 and ending at 4