|
關(guān)關(guān)采集器,主要使用正則采集,以下是正則的一些表達(dá) \d* 表示數(shù)字
\s* 表示空格+換行 .+? 表示字符(不能為空) .* 表示字符(可以為空) () 表示我們需要的部分 ((.|\n)*) 章節(jié)的內(nèi)容部分,包括了換行。 =====杰奇對應(yīng)===== !!!! 相當(dāng)于 ([^><]*) ~~~~ 相當(dāng)于 ([^><'"]*) ^^^^ 相當(dāng)于 ([^><\d]*) $$$$ 相當(dāng)于 ([\d]*) **** 相當(dāng)于 (.*) =====其他基本===== . 匹配任何單個(gè)字符。例如正則表達(dá)式r.t匹配這些字符串:rat、rut、r t,但是不匹配root。 $ 匹配行結(jié)束符。例如正則表達(dá)式weasel$ 能夠匹配字符串"He's a weasel"的末尾,但是不能匹配字符串"They are a bunch of weasels."。 ^ 匹配一行的開始。例如正則表達(dá)式^When in能夠匹配字符串"When in the course of human events"的開始,但是不能匹配"What and When in the"。 * 匹配0或多個(gè)正好在它之前的那個(gè)字符。例如正則表達(dá)式.*意味著能夠匹配任意數(shù)量的任何字符。 \ 這是引用府,用來將這里列出的這些元字符當(dāng)作普通的字符來進(jìn)行匹配。例如正則表達(dá)式\$被用來匹配美元符號,而不是行尾,類似的,正則 表達(dá)式\.用來匹配點(diǎn)字符,而不是任何字符的通配符。 萬能圖片規(guī)則<[^<]*((?<=<(?:img|IMG)[^>]*(?:(?:src|SRC)(?:\s*=\s*(?:["']?))))(?:[^\s"'>]*)\.(?:jpg|gif|jpeg|bmp|png|GIF|JPG)) [^>]*> 附帶:藏海閣文學(xué)網(wǎng) 采集規(guī)則,全文字的哦
<RuleConfigInfo xmlns:xsi="http://www./2001/XMLSchema-instance" xmlns:xsd="http://www./2001/XMLSchema"> <RuleVersion> <RegexName /> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </RuleVersion> <RuleID> <RegexName>RuleID</RegexName> <Pattern>1</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </RuleID> <GetSiteName> <RegexName>GetSiteName</RegexName> <Pattern>藏海閣</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </GetSiteName> <GetSiteCharset> <RegexName>GetSiteCharset</RegexName> <Pattern>utf-8</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </GetSiteCharset> <GetSiteUrl> <RegexName>GetSiteUrl</RegexName> <Pattern>http://www./</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </GetSiteUrl> <NovelSearchUrl> <RegexName>NovelSearchUrl</RegexName> <Pattern>http://www./Book/Search.aspx</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelSearchUrl> <NovelSearchData> <RegexName>NovelSearchData</RegexName> <Pattern>SearchKey={SearchKey}&SearchClass=1</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelSearchData> <NovelSearch_GetNovelKey> <RegexName>NovelSearch_GetNovelKey</RegexName> <Pattern><div id="CListTitle"><a href="/Book/(\d*)/Index.aspx" target="_blank"><b>{SearchKey}</b></a></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelSearch_GetNovelKey> <NovelListUrl> <RegexName>NovelListUrl</RegexName> <Pattern>http://www./type/1/</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelListUrl> <NovelList_GetNovelKey> <RegexName>NovelList_GetNovelKey</RegexName> <Pattern><a href="http://www./books/(\d*)/" id=".+?" title=".+?">(.+?)</a></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelList_GetNovelKey> <NovelUrl> <RegexName>NovelUrl</RegexName> <Pattern>http://www./books/{NovelKey}/</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelUrl> <NovelErr> <RegexName>NovelErr</RegexName> <Pattern>未找到該編號的書籍信息</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelErr> <NovelName> <RegexName>NovelName</RegexName> <Pattern><h1>(.+?)</h1></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelName> <NovelAuthor> <RegexName>NovelAuthor</RegexName> <Pattern>作者:(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelAuthor> <LagerSort> <RegexName>LagerSort</RegexName> <Pattern>書籍類別:(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </LagerSort> <SmallSort> <RegexName>SmallSort</RegexName> <Pattern>書籍類別:(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </SmallSort> <NovelIntro> <RegexName>NovelIntro</RegexName> <Pattern><div>內(nèi)容簡介:((.|\n)*?)</div>\s*</li></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern><span(.|\n)+?</span>|<p>|<a.+?</a>|</div></FilterPattern> </NovelIntro> <NovelKeyword> <RegexName>NovelKeyword</RegexName> <Pattern><h1>(.+?)</h1></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelKeyword> <NovelDegree> <RegexName>NovelDegree</RegexName> <Pattern>連載狀態(tài):(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelDegree> <NovelCover> <RegexName>NovelCover</RegexName> <Pattern><a class="pic"><img src="(.+?)"</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelCover> <NovelDefaultCoverUrl> <RegexName>NovelDefaultCoverUrl</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelDefaultCoverUrl> <NovelInfo_GetNovelPubKey> <RegexName>NovelInfo_GetNovelPubKey</RegexName> <Pattern>連載狀態(tài):(.+?)</span></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </NovelInfo_GetNovelPubKey> <PubCookies> <RegexName>PubCookies</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubCookies> <PubIndexUrl> <RegexName>PubIndexUrl</RegexName> <Pattern>http://www./books/{NovelKey}/</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubIndexUrl> <PubIndexErr> <RegexName>PubIndexErr</RegexName> <Pattern>這里必須填寫</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubIndexErr> <PubVolumeContent> <RegexName>PubVolumeContent</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubVolumeContent> <PubVolumeSplit> <RegexName>PubVolumeSplit</RegexName> <Pattern><h3></Pattern> <Method>Spilt</Method> <Options>None</Options> <FilterPattern /> </PubVolumeSplit> <PubVolumeName> <RegexName>PubVolumeName</RegexName> <Pattern>Title">(.+?)</div></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern> </FilterPattern> </PubVolumeName> <PubChapterName> <RegexName>PubChapterName</RegexName> <Pattern><li><a href=" http://www./book/\d*/\d*/">([^<]+?)</a></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubChapterName> <PubChapter_GetChapterKey> <RegexName>PubChapter_GetChapterKey</RegexName> <Pattern><li><a href="( http://www./book/\d*/\d*/)">[^<]+?</a></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubChapter_GetChapterKey> <PubContentUrl> <RegexName>PubContentUrl</RegexName> <Pattern>{ChapterKey}</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContentUrl> <PubContentErr> <RegexName>PubContentErr</RegexName> <Pattern>這里必須填寫</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContentErr> <PubContent_GetTextKey> <RegexName>PubContent_GetTextKey</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContent_GetTextKey> <PubTextUrl> <RegexName>PubTextUrl</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubTextUrl> <PubContentText> <RegexName>PubContentText</RegexName> <Pattern><div id="zjneirong" style="font-size:14px;width:100%;">((.|\n)+?)<hr</Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern><div.+?>|<div>|</div>|<DIV.+?>|</DIV>|<script(.|\n)+?</script>|<style(.|\n)+?</style>|<a(.|\n)+?</a></FilterPattern> </PubContentText> <PubContentReplace> <RegexName>PubContentReplace</RegexName> <Pattern /> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContentReplace> <PubContentImages> <RegexName>PubContentImages</RegexName> <Pattern><[^<]*((?<=<(?:img|IMG)[^>]*(?:(?:src|SRC)(?:\s*=\s*(?:["']?))))(?:[^\s"'>]*)\.(?:jpg|gif|jpeg|bmp|png|GIF|JPG))[^>]*></Pattern> <Method>Match</Method> <Options>None</Options> <FilterPattern /> </PubContentImages> </RuleConfigInfo> |
|
|