Issue metadata
Sign in to add a comment
|
Extension API Modification: Add length to TtsEvent |
||||||||||||||||||||||||||
Issue descriptionExtension API Modification Proposal API Namespace: tts API Owners: Accessibility team The following documents may not be necessary depending on the scope of your proposal: API Overview Doc: https://developer.chrome.com/extensions/tts#type-EventType Proposal: Add a new field to TtsEvent: integer - length The new field, length, represents the number of characters from the existing charIndex that comprise the next [word, sentence, utterance] depending on the event type. The existing charIndex would not be modified, but documentation would be updated to clarify that it is the character index of the current moment of time in the utterance. Design Doc: N/A Supplementary Resources: WIP change: https://chromium-review.googlesource.com/c/chromium/src/+/1385477 Adding the length will allow us to know exactly what word or phrase is being spoken, which is great for highlighting in accessibility features like Select-to-Speak on Chrome OS.
,
Dec 28
I'm not too opposed to this, and it seems like a simple enough change, but it seems like it might create a bit more work for authors. Would it make sense to instead pass the entire word or phrase being spoken during the start event, so that extensions don't have to look further for it? e.g., we could pass the full phrase in the "start" event, and the relevant word or sentence for those events.
,
Dec 28
I think passing the word/phrase as you suggest would be helpful as well! But I think indexes are still important because there's a chance that the currently spoken word could be a duplicate in the full phrase - i.e. in the sentence "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo." (https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo) This is motivated by wanting to add precise word-level highlighting (in the style of a sing-a-long or karaoke video) in the Select-to-Speak feature in Chrome OS. Currently we use the existing TTS word event character index and have to make some assumptions to find the corresponding start of the word. If the TTS engine passed the start and end indexes at 'word' events, word highlighting would be really easy. If the TTS engine passed phrases as you suggest, Select-to-Speak would still need to convert those back to indexes. I think adding the start character index is the smallest change to the API that will get the behavior needed in Select-to-Speak. However, doing something more complex is ok too. We could also roll out changes in phases: startCharIndex first, then some new options to pass the phrase at 'start' events and the word at 'word' events as well?
,
Jan 2
Talked with Devlin about this over chat last week and we decided the best outcome would be to add a startCharIndex and an endCharIndex, and later to deprecate charIndex. This would allow each ttsEvent to specify a range over which it occurs, for example a 'word' event could specify the indexes of the word which was spoken, or a 'sentence' event could specify the indexes of the sentence.
Here's how tts.json's TtsEvent type would change under this plan:
"charIndex": {
"type": "integer",
"optional": true,
"deprecated": "charIndex will be deprecated in 2020 in favor of startCharIndex and endCharIndex.",
"description": "The index of the current character in the utterance."
},
"startCharIndex": {
"type": "integer",
"optional": true,
"description": "The starting index of the currently spoken word, sentence, or utterance, depending on the event type. It will be set to -1 if not set by the speech engine or not relevant to the event type."
},
"endCharIndex": {
"type": "integer",
"optional": true,
"description": "The ending index of the currently spoken word, sentence, or utterance, depending on the event type. It will be set to -1 if not set by the speech engine or not relevant to the event type."
}
There is an open question that would need to be answered before proceeding in this way:
* When translating back to window.speechSynthesis utterance events, would the startCharIndex or the endCharIndex be used? The speech synthesis event only has a single charIndex: https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisEvent
Once that is decided, we need to ensure we get TTS start and end character indexes from other platform engines, i.e. from Mac, Win, Linux and Android TTS engines. I think it will be possible:
2.a. Mac may have this information: It appears to have wordPos and wordLen (https://developer.apple.com/documentation/applicationservices/speechwordprocptr)
2.b. Win appears to also have this: in SPEI_WORD_BOUNDARY you can use lParam and wParam to get the character position and word length. (https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ee431845%28v%3dvs.85%29)
2.c. Chrome OS will be OK as we own that TTS engine.
2.d. Chrome TTS with Linux and with Android doesn't support word or sentence events, so there is no work needed here.
The scope of this work is somewhat larger with the need to update each platform as part of the charIndex deprecation. I'll probably write a design doc if we decide to move forward in this direction. In order to make that decision, I'll discuss with dtseng@ and dmazzoni@.
,
Jan 9
Updating the proposal: Rather than sending start and end indexes, we propose: charIndex / length charIndex: That's what's passed now, and is the index in the utterance at the timestamp that the event is received. No changes here. Length: If it's a word event, it's the length of the word. If it's a start event, it's the length of the utterance. If it's an end event, it's the length of the remaining chars (0). If it's a sentence event, it's the length of a sentence. @Devlin, what do you think? This removes the need to deprecate anything and is pretty clear across all event types.
,
Jan 10
,
Jan 10
,
Jan 14
Discussed offline with dmazzoni@ and katie@ - charIndex + length SGTM. It's still non-deterministic, since charIndex could refer to either the beginning of a new word or the end of the previous word, and length could be similarly affected. Ideally, I would have much preferred that we have something clearer like startIndex and endIndex which were strictly defined as the bounds of the words/sentences, but that's not within something we can get from the engines at this time. Flipping the API review bit. Thanks all for your patience!
,
Jan 19
(4 days ago)
Requesting other review bits. Thanks! |
|||||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||||
Comment 1 by karandeepb@chromium.org
, Dec 28Status: Assigned (was: Untriaged)