[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

[tor-commits] [metrics-lib/master] Optimize parsing large files with many descriptors.

To: tor-commits@xxxxxxxxxxxxxxxxxxxx
Subject: [tor-commits] [metrics-lib/master] Optimize parsing large files with many descriptors.
From: karsten@xxxxxxxxxxxxxx
Date: Fri, 11 Dec 2020 14:01:09 +0000 (UTC)
Delivered-to: archiver@xxxxxxxx
Delivery-date: Fri, 11 Dec 2020 09:01:20 -0500
List-archive: <http://lists.torproject.org/pipermail/tor-commits/>
List-help: <mailto:tor-commits-request@lists.torproject.org?subject=help>
List-id: "auto: code repository commits" <tor-commits.lists.torproject.org>
List-post: <mailto:tor-commits@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-commits>, <mailto:tor-commits-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-commits>, <mailto:tor-commits-request@lists.torproject.org?subject=unsubscribe>
Patch-author: Karsten Loesing <karsten.loesing@xxxxxxx>
Sender: "tor-commits" <tor-commits-bounces@xxxxxxxxxxxxxxxxxxxx>

commit ff7e36c15626bdc24df54ebd94da5ab58f4de4c4
Author: Karsten Loesing <karsten.loesing@xxxxxxx>
Date:   Thu Dec 10 17:54:02 2020 +0100

    Optimize parsing large files with many descriptors.
    
    When parsing a large file with many descriptors we would repeatedly
    search the remaining file for the sequence "newline + keyword + space"
    and then "newline + keyword + newline" to find the start of the next
    descriptor. However, if the keyword is always followed by newline, the
    first search would always fail.
    
    The optimization here is to search once whether the keyword is
    followed by space or newline and avoid unnecessary searches when going
    through the file.
    
    In the long term we should use a better parser. But in the short term
    this optimization will have a major impact on performance, in
    particular with regard to concatenated microdescriptors.
---
 CHANGELOG.md                                       |  3 +++
 .../descriptor/impl/DescriptorParserImpl.java      | 27 ++++++++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 8ff5723..828718d 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,9 @@
    - Parse version 3 onion service statistics contained in extra-info
      descriptors.
 
+ * Medium changes
+   - Optimize parsing of large files containing many descriptors.
+
 
 # Changes in version 2.14.0 - 2020-08-07
 
diff --git a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
index e008e7a..abe4411 100644
--- a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
+++ b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
@@ -181,16 +181,25 @@ public class DescriptorParserImpl implements DescriptorParser {
     String ascii = new String(rawDescriptorBytes, StandardCharsets.US_ASCII);
     boolean containsAnnotations = ascii.startsWith("@")
         || ascii.contains(NL + "@");
+    boolean containsKeywordSpace = ascii.startsWith(key.keyword + SP)
+        || ascii.contains(NL + key.keyword + SP);
+    boolean containsKeywordNewline = ascii.startsWith(key.keyword + NL)
+        || ascii.contains(NL + key.keyword + NL);
     while (startAnnotations < endAllDescriptors) {
-      int startDescriptor;
-      if (startAnnotations == ascii.indexOf(key.keyword + SP,
-          startAnnotations) || startAnnotations == ascii.indexOf(
-          key.keyword + NL)) {
+      int startDescriptor = -1;
+      if ((containsKeywordSpace
+          && startAnnotations == ascii.indexOf(key.keyword + SP,
+          startAnnotations))
+          || (containsKeywordNewline
+          && startAnnotations == ascii.indexOf(key.keyword + NL,
+          startAnnotations))) {
         startDescriptor = startAnnotations;
       } else {
-        startDescriptor = ascii.indexOf(NL + key.keyword + SP,
-            startAnnotations - 1);
-        if (startDescriptor < 0) {
+        if (containsKeywordSpace) {
+          startDescriptor = ascii.indexOf(NL + key.keyword + SP,
+              startAnnotations - 1);
+        }
+        if (startDescriptor < 0 && containsKeywordNewline) {
           startDescriptor = ascii.indexOf(NL + key.keyword + NL,
               startAnnotations - 1);
         }
@@ -204,10 +213,10 @@ public class DescriptorParserImpl implements DescriptorParser {
       if (containsAnnotations) {
         endDescriptor = ascii.indexOf(NL + "@", startDescriptor);
       }
-      if (endDescriptor < 0) {
+      if (endDescriptor < 0 && containsKeywordSpace) {
         endDescriptor = ascii.indexOf(NL + key.keyword + SP, startDescriptor);
       }
-      if (endDescriptor < 0) {
+      if (endDescriptor < 0 && containsKeywordNewline) {
         endDescriptor = ascii.indexOf(NL + key.keyword + NL, startDescriptor);
       }
       if (endDescriptor < 0) {



_______________________________________________
tor-commits mailing list
tor-commits@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-commits

Prev by Author: [tor-commits] [collector/master] Make sure that the DirectoryStream gets closed.
Next by Author: [tor-commits] [collector/master] Only clean up a single time during sync.
Previous by thread: [tor-commits] [tor/master] Fix formatting in comment in parse_port_config()
Next by thread: [tor-commits] [metrics-lib/master] Bump version to 2.15.0-dev.
Index(es):
- Author
- Thread