Re: [tor-dev] QUIC TOR Debugging Question (no attach)

Tim:Â

Sorry for not being specific enough on my questions. I'll try to give more detailed questions later instead of higher-level problems.Â

Regarding the frequency of my emails, I apologize for the long intervals but the reason is that I'm not full-time on this project and a lot of times I had exams and I can only work on the QUIC TOR project for a couple of days every week. Fortunately, I'm not nearly done with all my finals for this semester and I can put more time into this project from now on.Â

Right now, I have two specific questions:Â

1. We just switched to testing on EmuLab (each node is a standalone machine) from chutney. After the switch, a particular bug on chutney disappeared: on chutney, some nodes used to crash mysteriously with no log outputs (all the log simply stops, with no stack trace or anything). This bug only occurs when there's existing cache (the first run after chutney configure is fine). After porting onto EmuLab (a testing framework), using almost identical torrc file, this bug disappeared and everything runs just fine for now. Right now we are ignoring this bug. Have you seen similar issues on chutney?Â

2. The circuit building process is taking too long and many of them expires. We have 4 relays where 2 of which are also authorities. From the logs, I'm seeing a lot of the following lines:Â

circuit_expire_building(): Abandoning circ XX XXXX:XX:12345 (state 0,0: doing handshakes, purpose 5, len 3)
router_choose_random_node(): We found 3 running nodes.
router_choose_random_node(): We removed 1 excludednodes, leaving 2 nodes.
router_choose_random_node(): We removed 2 excludedsmartlist, leaving 0 nodes Â Â Â.

The first line happens when we have connected to the first node and waiting for a response from the second or sometimes the third relay. And the second log happens when we are trying to choose the path to use for a circuit. What could I do to increase the number of available nodes? Should I increase the frequency of reachability tests?Â

After looking at the code, there's a circuitbuild.c line 2172 describing why some nodes are excluded, which I don't quite understand. Specifically, the comment says: "XXXX025 use the using_as_guard flag to accomplish this." where can I find more information on this XXXX025 issue (committed here)? Why are these routers being excluded?Â

Please let me know if you want more specific information on those issues.Â

Thank you!

Li.Â

On Sun, Apr 24, 2016 at 11:33 PM, Tim Wilson-Brown - teor <teor2345@xxxxxxxxx> wrote:

> On 25 Apr 2016, at 06:44, Xiaofan Li <xli2@xxxxxxxxxxxxxx> wrote:
>
> Hi Tim and everyone on tor-dev,
>
> Our QUIC + TOR project has almost been fully implemented. We are debugging the last few bits of bugs. Update:
>Â Â Â Ââ We've now able to build many complete circuits with QUIC as its underlying protocol.
>Â Â Â Ââ We have not debugged the actual communication part yet. We are aware of certain failure cases for QUIC (e.g. line 15642 of the log is being debugged right now). So we cannot send actually client data yet.
>Â Â Â Ââ The current state uses QUIC for OR connections only. Thus a dual-path is implemented as suggested in my last email thread.
>Â Â Â Ââ TLS is completely bypassed and important state (that is set up in tls_handshake functions) is preserved and refactored out. e.g. conn->/chan->state purpose, etc.
>Â Â Â Ââ Some tinkering and re-designing of QUIC itself is also underway. The fact that QUIC is a transport protocol on application layer makes it painful to interact with the event and timer systems of TOR. We are trying to improve this aspect now.
> The debugging log I was attaching was too big for the tor-dev list. So if you are interested to take a look at the file, let me know.

Large debug logs contain too much information to be helpful to you or to us.

Try warning, notice, or info level logs, in that order.
Using high-level logs makes it easier to work out where your attempts to send data have broken down.
Once you've identified where communication has broken down, try to fix it.

If you can't fix it, you're welcome to ask for advice.
Please quote a small number of relevant log messages, tell us what you think they mean, and what you've tried to do to fix it.
Also feel free to provide a link to logs at that level for people to look through.

This makes it more likely that people will recognise your issue and respond by helping you to fix it.

> Some particularly concerning things in the log:
>Â Â Â Ââ circuit_get_by_circid_channel_impl(): found nothing for circ_id 14801, channel ID 2 (0x7f758bb6b740)
> Then it just attaches this circ onto this channel.. Is this normal?
>Â Â Â Ââ Line 4901 circuit_receive_relay_cell(): Passing on unrecognized cell.
> It happens a lot. Is this normal?
>Â Â Â Ââ This sequence happened a lot around 7500.
> relay_send_command_from_edge_(): delivering 10 cell forward.
> circuit_package_relay_cell(): crypting a layer of the relay cell.
> circuit_package_relay_cell(): crypting a layer of the relay cell.
> circuit_package_relay_cell(): crypting a layer of the relay cell.
> It seems like its decrypting and forwarding cells along. Is it normal for TOR (with TCP) to do this in a burst? Because I'm seeing about ~1s of repeated calls.

I honestly don't think these are concerning at all. But I don't really know.
And I can't find out, because I don't know which version of tor you've based your changes on.

Here's how you can find out whether these log messages are typical or not:

Run the original version of Tor that you've based your QUIC changes on, with the same network configuration.
(Does it work? If not, your QUIC network will likely never work either.)
Then compare the warning, notice, and info logs to tor with QUIC.
Stop at the first log that differs in non-trivial ways.
This is a log level that's useful for you.
(High-level logs will also cause you less concern about spurious messages.)

This way, you can answer your own questions about which logs and behaviours are normal, and which ones you've introduced.
Feel free to report back with any log messages from the unmodified version of Tor that might indicate bugs.

> Some more general questions:
>Â Â Â Ââ Internal Circuits: any docs? What is it used for? Measuring bandwidth?

Relay bandwidth testing, relay reachability testing (default chutney configs skip this using AssumeReachable), client directory fetches, hidden service directory document uploads, onion services (hidden services), â

Read the ~12 instances of CIRCLAUNCH_IS_INTERNAL in the tor source code for more details.

> How many internal circuits are required by the system?

As many are as necessary to support the operation of the Tor client / relay / onion service at the current time.
Initially, 2 or 3 (read circuit_predict_and_launch_new for more details).

>Â Â Â Ââ circuit wide ID format. We had a bug regarding this last week. The check in process_create_cell always fails because line 281-295 in command.c always failed (the check for CIRC_ID_TYPE and id_is_high). Currently we commented out this check. What does it affect? And could we do this?

I can't see how this could be your client communication issue. It's only an issue if the circuit IDs collide, which should be unlikely in small networks.

When two relays create circuits on a connection, one uses the lower half of the circuit id space, and one uses the upper half. This prevents circuit IDs colliding. Read the definitions of circ_id_type, circ_id_type_t, and channel_set_circid_type for details.

The version of the link protocol determines how this decision is made.
I assume that your tor has chan->conn->link_proto >= MIN_LINK_PROTO_FOR_WIDE_CIRC_IDS.
(You can check this by printing out the value of chan->conn->link_proto everywhere channel_set_circid_type is called.)

So you've removed TLS client identity and TLS server identity keys.
What do get_tlsclient_identity_key and get_server_identity_key return?
Null bytes?

Is there a publicly known key in QUIC that's known by both sides and stable for the life of a connection?
If so, use that.

If not, always pass 0 for consider_identity to channel_set_circid_type, so that the initiator uses the upper half of the circuit IDs, regardless of keys.

Breaking other parts of the circuit management code could also cause this issue.

>Â Â Â Ââ From a high level, when a client sends data using a circuit, what is its code path? Which special (as in, specific to client-initiated communication) functions are called?

I'm not sure how to answer this question. The unhelpful but accurate answer is "not many codepaths are client-specific, if there are any at all".

Regardless of its role in the network, every tor instance performs common operations like retrieving consensus documents and building circuits. And, if configured to do so, tor instances can perform multiple roles.

Here are some high-level differences between client and server communication in the tor network that could be causing your issues:

Typically, clients, onion services, and bridges retrieve directory documents using "begindir", a TLS connection to the ORPort. Relays and authorities do this unencrypted over the DirPort. If you haven't replaced TLS with QUIC correctly, clients may fail to bootstrap or retrieve directory documents. There should be log messages about this.

Clients have a SOCKSPort open, and in response to application requests they make an AP (application) connection that's linked to a stream on a circuit that's been extended to the destination exit relay. They then send requests received on the SOCKSPort to the destination relay, and receive responses that they forward to the application. (The onion service setup is slightly more complex, but transmits data in a similar way.)

Have you read torguts?
https://gitweb.torproject.org/user/nickm/torguts.git/

Any part of this process could break and cause client communication to fail.
Parts of the relay code could also break in ways that cause client communication to fail.

I can't see how to describe specific code paths without more specific (and precise) detail about what's failing, and whether it's failing on clients or relays. You can find this in the logs, if you log sensibly. Let us know what you find, and what you tried to do to fix it.

What high-level success or failure message (warning, log, info) is logged on the client right after you try to make an application connection?
Does the connection reach a relay? The exit? DNS? The remote site?
What warning, notice, or info-level message is logged on the last tor node where the connection stops working?
(Or what DNS or HTTP request is sent to the remote server / site?)

> Any other comment on the log is greatly appreciated, since everyone here is probably more familiar than me with what a normal bootstrapping process would look like.

Don't worry too much about the log messages. They're designed to be used for debugging once there is a known issue.
The vast majority are harmless, and many need context to interpret. You can find this context by searching the tor code for unique words or phrases in the log message. (But keep in mind that log strings are often composed of shorter strings.)

Some general requests for future questions:

It would be much easier and faster for me (and perhaps others) to help you if you asked questions after trying to identify and fix issues yourself. I encourage you to try some of the things I've suggested, and ask more precise questions next time.

Personally, I would find it easier to respond to targeted questions that come one at a time, every few days or every week, rather than a large email every few weeks.

It might also be helpful to be able to see the source code you're working on, rather than trying to guess, what changes you've made, from what I remember, about what you said, about your design, in previous emails.

Tim

Tim Wilson-Brown (teor)

teor2345 at gmail dot com
PGP 968F094B
ricochet:ekmygaiu4rzgsk6n